HYPERSAGE: GENERALIZING INDUCTIVE REPRESENTATION LEARNING ON HYPERGRAPHS

Abstract

Graphs are the most ubiquitous form of structured data representation used in machine learning. They model, however, only pairwise relations between nodes and are not designed for encoding the higher-order relations found in many real-world datasets. To model such complex relations, hypergraphs have proven to be a natural representation. Learning the node representations in a hypergraph is more complex than in a graph as it involves information propagation at two levels: within every hyperedge and across the hyperedges. Most current approaches first transform a hypergraph structure to a graph for use in existing geometric deep learning algorithms. This transformation leads to information loss, and sub-optimal exploitation of the hypergraph's expressive power. We present HyperSAGE, a novel hypergraph learning framework that uses a two-level neural message passing strategy to accurately and efficiently propagate information through hypergraphs. The flexible design of HyperSAGE facilitates different ways of aggregating neighborhood information. Unlike the majority of related work which is transductive, our approach, inspired by the popular GraphSAGE method, is inductive. Thus, it can also be used on previously unseen nodes, facilitating deployment in problems such as evolving or partially observed hypergraphs. Through extensive experimentation, we show that HyperSAGE outperforms state-of-the-art hypergraph learning methods on representative benchmark datasets. We also demonstrate that the higher expressive power of HyperSAGE makes it more stable in learning node representations as compared to the alternatives.

1. INTRODUCTION

Graphs are considered the most prevalent structures for discovering useful information within a network, especially because of their capability to combine object-level information with the underlying inter-object relations (Wu et al., 2020) . However, most structures encountered in practical applications form groups and relations that cannot be properly represented using pairwise connections alone, hence a graph may fail to capture the collective flow of information across objects. In addition, the underlying data structure might be evolving and only partially observed. Such dynamic higher-order relations occur in various domains, such as social networks (Tan et al., 2011) , computational chemistry (Gu et al., 2020) , neuroscience (Gu et al., 2017) and visual arts (Arya et al., 2019) , among others. These relations can be readily represented with hypergraphs, where an edge can connect an arbitrary number of vertices as opposed to just two vertices in graphs. Hypergraphs thus provide a more flexible and natural framework to represent such multi-way relations (Wolf et al., 2016) , however, this requires a representation learning technique that exploits the full expressive power of hypergraphs and can generalize on unseen nodes from a partially observed hypergraph. Recent work in the field of geometric deep learning have presented formulations on graph structured data for the tasks of node classification (Kipf & Welling, 2016) , link prediction (Zhang & Chen, 2018) , or the classification of graphs (Zhang et al., 2018b) . Subsequently, for data containing higher-order relations, a few recent papers have presented hypergraph-based learning approaches on similar tasks (Yadati et al., 2019; Feng et al., 2019) . A common implicit premise in these papers is that a hypergraph can be viewed as a specific type of regular graph. Therefore, reduction of hypergraph learning problem to that of a graph should suffice. Strategies to reduce a hypergraph to a graph include transforming the hyperedges into multiple edges using clique expansion (Feng et al., 2019; Jiang et al., 2019; Zhang et al., 2018a) , converting to a heterogeneous graph using star v 3 v 4 v 1 v 2 v 5 v 1 v 2 v 3 v 1 e 1 v 2 v 3 e 3 e 2 v 5 v 4 v 5 v 4 clique expansion star expansion Hypergraph (a) v 1 v 7 v 3 v 6 v 5 v 4 v 2 v 1 v 7 v 2 v 6 v 5 v 4 v 3 (b) Figure 1 : (a) Example showing reduction of a hypergraph to a graph using clique and star expansion methods. The clique expansion loses the unique information associated with the hyperedge defined by the set of nodes {v 2 , v 3 }, and it cannot distinguish it from the hyperedge defined by the nodes {v 1 , v 2 , v 3 }. Star expansion creates a heterogeneous graph that is difficult to handle using most well-studied graph methods (Hein et al., 2013) . (b) Schematic representations of two Fano planes comprising 7 nodes and 7 hyperedges (6 straight lines and 1 circle.). The second Fano plane is a copy of the first with nodes v 2 and v 3 permuted. These two hypergraphs cannot be differentiated when transformed to a graph using clique expansion. expansion (Agarwal et al., 2006) , and replacing every hyperedge with an edge created using a certain predefined metric (Yadati et al., 2019) . Yet these methods are based on the wrong premise, motivated chiefly by a larger availability of graph-based approaches. By reducing a hypergraph to regular graph, these approaches make existing graph learning algorithms applicable to hypergraphs. However, hypergraphs are not a special case of regular graphs. The opposite is true, regular graphs are simply a specific type of hypergraph (Berge & Minieka, 1976) . Therefore, reducing the hypergraph problem to that of a graph cannot fully utilize the information available in hypergraph. Two schematic examples outlining this issue are shown in Fig. 1 . To address tasks based on complex structured data, a hypergraph-based formulation is needed that complies with the properties of a hypergraph. A major limitation of the existing hypergraph learning frameworks is their inherently transductive nature. This implies that these methods can only predict characteristics of nodes that were present in the hypergraph at training time, and fail to infer on previously unseen nodes. The transductive nature of existing hypegraph approaches makes them inapplicable in, for example, finding the most promising target audience for a marketing campaign or making movie recommendations with new movies appearing all the time. An inductive solution would pave the way to solve such problems using hypergraphs. The inductive learning framework must be able to identify both the node's local role in the hypergraph, as well as its global position (Hamilton et al., 2017) . This is important for generalizing the learned node embeddings that the algorithm has optimized on to a newly observed hypergraph comprising previously unseen nodes, thus, making inductive learning a far more complex problem compared to the transductive learning methods. In this paper, we address the above mentioned limitations of the existing hypergraph learning methods. We propose a simple yet effective inductive learning framework for hypergraphs that is readily applicable to graphs as well. Our approach relies on neural message passing techniques due to which it can be used on hypergraphs of any degree of cardinality without the need for reduction to graphs. The points below highlight the contributions of this paper: • We address the challenging problem of representation learning on hypergraphs by proposing HyperSAGE, comprising a message passing scheme which is capable of jointly capturing the intra-relations (within a hyperedge) as well as inter-relations (across hyperedges). • The proposed hypergraph learning framework is inductive, i.e. it can perform predictions on previously unseen nodes, and can thus be used to model evolving hypergraphs. • HyperSAGE facilitates neighborhood sampling and provides the flexibility in choosing different ways to aggregate information from the neighborhood. • HyperSAGE is more stable than state-of-the-art methods, thus provides more accurate results on node classification tasks on hypergraphs with reduced variance in the output. Learning node representations using graph neural networks has been a popular research topic in the field of geometric deep learning (Bronstein et al., 2017) . Graph neural networks can be broadly classified into spatial (message passing) and spectral networks. We focus on a family of spatial message passing graph neural networks that take a graph with some labeled nodes as input and learn embeddings for each node by aggregating information from its neighbors (Xu et al., 2019) . Message passing operations in a graph simply propagate information along the edge connecting two nodes. Many variants of such message passing neural networks have been proposed, with some popular ones including Gori et al. (2005) ; Li et al. (2015) ; Kipf & Welling (2016); Gilmer et al. (2017) ; Hamilton et al. (2017) . Zhou et al. (2007) introduced learning on hypergraphs to model high-order relations for semisupervised classification and clustering of nodes. Emulating a graph-based message passing framework for hypergraphs is not straightforward since a hyperedge involves more than two nodes which makes the interactions inside each hyperedge more complex. Representing a hypergraph with a matrix makes it rigid in describing the structures of higher order relations (Li et al., 2013) . On the other hand, formulating message passing on a higher dimensional representation of hypergraph using tensors makes it computationally expensive and restricts it to only small datasets (Zhang et al., 2019) . Several tensor based methods do perform learning on hypergraphs (Shashua et al., 2006; Arya et al., 2019) , however they are limited to uniform hypergraphs only. To resolve the above issues, Feng et al. (2019) and Bai et al. (2020) reduce a hypergraph to graph using clique expansion and perform graph convolutions on them. These approaches cannot utilize complete structural information in the hypergraph and lead to unreliable learning performance for e.g. classification, clustering and active learning (Li & Milenkovic, 2017; Chien et al., 2019) . Another approach by Yadati et al. (2019) , named HyperGCN, replaces a hyperedge with pair-wise weighted edges between vertices (called mediators). With the use of mediators, HyperGCN can be interpreted as an improved approach of clique expansion, and to the best of our knowledge, is also the state-of-the-art method for hypergraph representation learning. However, for many cases such as Fano plane where each hyperedge contains at most three nodes, HyperGCN becomes equivalent to the clique expansion (Dong et al., 2020) . In spectral theory of hypergraphs, methods have been proposed that fully exploit the hypergraph structure using non-linear Laplacian operators (Chan et al., 2018; Hein et al., 2013) . In this work, we focus on message passing frameworks. Drawing inspiration from GraphSAGE (Hamilton et al., 2017) , we propose to eliminate matrix (or tensor) based formulations in our neural message passing frameworks, which not only facilitates utilization of all the available information in a hypergraph, but also makes the entire framework inductive in nature.

3. PROPOSED MODEL: HYPERSAGE

The core concept behind our approach is to aggregate feature information from the neighborhood of a node spanning across multiple hyperedges, where the edges can have varying cardinality. Below, we first define some preliminary terms, and then describe our generic aggregation framework. This framework performs message passing at two-levels for a hypergraph. Further, for any graphstructured data, our framework emulates the one-level aggregation similar to GraphSAGE (Hamilton et al., 2017) . Our approach inherently allows inductive learning, which makes it also applicable on hypergraphs with unseen nodes.

3.1. PRELIMINARIES

Definition 1 (Hypergraph). A general hypergraph H can be represented as H = (V, E, X), where V = {v 1 , v 2 , ..., v N } denotes a set of N nodes (vertices) and E = {e 1 , e 2 , ..., e K } denotes a set of hyperedges, with each hyperedge comprising a non-empty subset from V. X ∈ R N ×d denote the feature matrix, such that x i ∈ X is the feature vector characterizing node v i ∈ V. The maximum cardinality of the hyperedges in H is denoted as M = max e∈E |e|. Unlike in a graph, the hyperedges of H can contain different number of nodes and M denotes the largest number. From the definition above, we see that graphs are a special case of hypergraphs with M =2. Thus, compared to graphs, hypergraphs are designed to model higher-order relations between nodes. Further, we define three types of neighborhoods in a hypergraph: Definition 2 (Intra-edge neighborhood). The intra-edge neighborhood of a node v i ∈ V for any hyperedge e ∈ E is defined as the set of nodes v j belonging to e and is denoted by N(v i , e) Further, let E(v i ) = {e ∈ E | v i ∈ e} be the sets of hyperedges that contain node v i . Definition 3 (Inter-edge neighborhood). The inter-edge neighborhood of a node v i ∈ V also referred as its global neighborhood, is defined as the neighborhood of v i spanning across the set of hyperedges E(v i ) and is represented by N(v i ) = e∈E(vi) N(v i , e). Definition 4 (Condensed neighborhood). The condensed neighborhood of any node v i ∈ V is a sampled set of α ≤ |e| nodes from a hyperedge e ∈ E(v i ) denoted by N (v i , e; α) ⊂ N(v i , e).

3.2. GENERALIZED MESSAGE PASSING FRAMEWORK

We propose to interpret the propagation of information in a given hypergraph as a two-level aggregation problem, where the neighborhood of any node is divided into intra-edge neighbors and inter-edge neighbors. For message aggregation, we define aggregation function F(•) as a permutation invariant set function on a hypergraph H = (V, E, X) that takes as input a countable unordered message set and outputs a reduced or aggregated message. Further, for two-level aggregation, let F 1 (•) and F 2 (•) denote the intra-edge and inter-edge aggregation functions, respectively. Schematic representation of the two aggregation functions is provided in Fig. 2 . Similar to X we also define Z as the encoded feature matrix built using the outputs z i of aggregation functions. Message passing at node v i for aggregation of information at the l th layer can then be stated as x (e) i,l ← F 1 ({x j,l-1 | v j ∈ N(v i , e; α)}), x i,l ← x i,l-1 + F 2 ({x (e) i,l |v i ∈ E(v i )}), where, x i,l refers to the aggregated feature set at v i obtained with intra-edge aggregation for edge e. The combined two-level message passing is achieved using nested aggregation function F = F 2 . To ensure that the expressive power of a hypergraph is preserved or at least the loss is minimized, the choice of aggregation function should comply with certain properties. F 1 (•) F 1 (•) F 2 (•) xi zi vi e A e B Firstly, the aggregation function should be able to capture the features of neighborhood vertices in a manner that is invariant to the permutation of the nodes and hyperedges. Many graph representation learning methods use permutation invariant aggregation functions, such as mean, sum and max functions (Xu et al., 2019) . These aggregations have proven to be successful for node classification problems. For the existing hypergraph frameworks, reduction to simple graphs along with a matrix-based message passing framework limits the possibilities of using different types of feature aggregation functions, and hence curtails the potential to explore unique node representations. Secondly, the aggregation function should also preserve the global neighborhood invariance at the 'dominant nodes' of the graph. Here, dominant nodes refer to nodes that contain important features, thereby, impacting the learning process relatively more than their neighbors. The aggregation function should ideally be insensitive to the input, whether the provided hypergraph contains a few large hyperedges, or a larger number of smaller ones obtained from splitting them. Generally, a hyperedge would be split in a manner that the dominant nodes are shared across the resulting hyperedges. In such cases, global neighborhood invariance would imply that the aggregated output at these nodes before and after the splitting of any associated hyperedge stays the same. Otherwise, the learned representation of a node will change significantly with each hyperedge split. Based on these considerations, we define the following properties for a generic message aggregation function that should hold for accurate propagation of information through the hypergraphs. Property 1 (Hypergraph Isomorphic Equivariance). A message aggregation function F(•) is equivariant to hypergraph isomorphism, if for two isomorphic hypergraphs H = (V, E, X) and The flexibility of our message passing framework allows us to go beyond the simple aggregation functions on hypergraphs without violating Property 1. We introduce a series of power mean functions as aggregators, which have recently been shown to generalize well on graphs (Li et al., 2020) . We perform message aggregation in hypergraphs using these generalized means, denoted by M p and provide in section 4.2, a study on their performances. We also show that with appropriate combinations of the intra-edge and inter-edge aggregations Property 2 is also satisfied. This property ensures that the representation of a node after message passing is invariant to the cardinality of the hyperedge, i.e., the aggregation scheme should not be sensitive to hyperedge contraction or expansion, as long as the global neighborhood of a node remains the same in the hypergraph. Aggregation Functions. One major advantage of our strategy is that the message passing module is decoupled from the choice of the aggregation itself. This allows our approach to be used with a broad set of aggregation functions. We discuss below a few such possible choices. H * = (V * , E * , X * ), given that H * = σ • H, Generalized means. Also referred to as power means, this class of functions are very commonly used for getting an aggregated measure over a given set of samples. Mathematically, generalized means can be expressed as M p = 1 n n i=1 x p i 1 p , where n refers to the number of samples in the aggregation, and p denotes its power. The choice of p allows providing different interpretations to the aggregation function. For example, p = 1 denotes arithmetic mean aggregation, p = 2 refers to mean squared estimate and a large value of p corresponds to max pooling from the group. Similarly, M p can be used for geometric and harmonic means with p → 0 and p = -1, respectively. Similar to the recent work of Li et al. (2020) , we use generalized means for intra-edge as well as inter-edge aggregation. The two functions F 1 (•) and F 2 (•) for aggregation at node v i is defined as F (i) 1 (s) =    1 |N(v i , e)||N(v i )| vj ∈N(vi,e)   |E(vi)| m=1 1 |N(v i , e m )|   -1 x p j    1 p (3) F (i) 2 (s) =   1 |E(v i )| e∈E(vi) (F 1 (s)) p   1 p (4) where we use 's' for concise representation of the unordered set of input as shown in Eq.1. Here and henceforth in this paper, we remove the superscript index '(i)' for the sake of clarity and further occurrences of the two aggregation functions shall be interpreted in terms of node v i . Note that in Eq. 3 and Eq. 4, we have chosen the power term p to be same for F 1 and F 2 so as to satisfy the global neighborhood invariance as stated in Property 2. Note, the scaling term added to F 1 is added to balance the bias in the weighting introduced in intra-edge aggregation due to varying cardinality across the hyperedges. These restrictions ensure that the joint aggregation F 2 (•) satisfies the property of global neighborhood invariance at all times. Proof of the two aggregations satisfying Property 2 is stated in Appendix B. Sampling-based Aggregation. Our neural message passing scheme provides the flexibility to adapt the message aggregation module to fit the desired computational budget through aggregating information from only a subset N (v i , e; α) of the full neighborhood N (v i , e), if needed. We propose to apply sub-sampling only on the nodes from the training set, and use information from the full neighborhood for the test set. The advantages of this are twofold. First, reduced number of samples per aggregation at training time reduces the relative computational burden. Second, similar to dropout (Srivastava et al., 2014) , it serves to add regularization to the optimization process. Using the full neighborhood on test data avoids randomness in the test predictions, and generates consistent output.

3.3. INDUCTIVE LEARNING ON HYPERGRAPHS

HyperSAGE is a general framework for learning node representations on hypergraphs, on even unseen nodes. Our approach uses a neural network comprising L layers, and feature-aggregation is performed at each of these layers, as well as across the hyperedges.

Algorithm 1 HyperSAGE Message Passing

Input : H = (V, E, X); depth L; weight matrices W l for l = 1 . . . L; non-linearity σ; intra-edge aggregation function F 1 (•); inter-edge aggregation function F 2 (•) Output: Node embeddings z i | v i ∈ V h 0 i ← x i ∈ X | v i ∈ V for l = 1 . . . L do for e ∈ E do h l i ← h l-1 i for v i ∈ e do h l i ← h l i + F (i) 2 (s) end end h l i ← σ(W l (h l i /||h l i || 2 )) | v i ∈ V end z i ← h L i | v i ∈ V Algorithm 1 describes the forward propagation mechanism which implements the aggregation function F(•) = F 2 (•) described above. At each iteration, nodes first aggregate information from their neighbors within a specific hyperedge. This is repeated over all the hyperedges across all the L layers of the network. The trainable weight matrices W l with l ∈ L are used to aggregate information across the feature dimension and propagate it through the various layers of the hypergraph.

Generalizability of HyperSAGE.

Hyper-SAGE can be interpreted as a generalized formulation that unifies various existing graphbased as well as hypergraph formulations. Our approach unifies them, identifying each of these as special variants/cases of our method. We discuss here briefly the two popular algorithms.

Graph Convolution Networks (GCN).

The GCN approach proposed by Kipf & Welling ( 2016) is a graph-based method that can be derived as a special case of HyperSAGE with maximum cardinality |M | = 2, and setting the agggregation function F 2 = M p with p = 1. This being a graph-based method, F 1 will not be used. GraphSAGE. Our approach, when reduced for graphs using |M | = 2, is similar to GraphSAGE. For exact match, the aggregation function F 2 should be one of mean, max or LST M . Further, the sampling term α can be adjusted to match the number of samples per aggregation as in GraphSAGE.

4.1. EXPERIMENTAL SETUP

For the experiments in this paper, we use co-citation and co-authorship network datasets: CiteSeer, PubMed, Cora (Sen et al., 2008) and DBLP (Rossi & Ahmed, 2015) . The task for each dataset is to predict the topic to which a document belongs (multi-class classification). For these datasets, x i corresponds to a bag of words such that x i,j ∈ x i represents the normalized frequency of occurence of the j th word. Additional details related to the hypergraph topology are presented in Appendix 

4.2. SEMI-SUPERVISED NODE CLASSIFICATION ON HYPERGRAPHS

Performance comparison with existing methods. We implemented HyperSAGE for the task of semi-supervised classification of nodes on a hypergraph, and the results are compared with stateof-the art methods. These include (a) Multi-layer perceptron with explicit hypergraph Laplacian regularisation (MLP + HLR), (b) Hypergraph Neural Networks (HGNN) (Feng et al., 2019) which uses a clique expansion, and (c) HyperGCN and its variants (Yadati et al., 2019) that collapse the hyperedges using mediators. For HyperSAGE method, we use 4 variants of generalized means M p with p = 1, 2, -1 and 0.01 with complete neighborhood i.e., α = |e|. For all the cases, 10 data splits over 8 random weight initializations are used, totalling 80 experiments per method and for every dataset. The data splits are the same as in HyperGCN described in Appendix A.1. Table 1 shows the results obtained for the node classification task. We see that the different variants of HyperSAGE consistently show better scores across our benchmark datasets, except Cora cocitation where no improvement is observed compared to HGNN. Cora co-citation data is relatively small in size with a cardinality of 3.0 ± 1.1, and we speculate that there does not exist enough scope of improving with HyperSAGE beyond what HGNN can express with the clique expansion. For the larger datasets such as DBLP and Pubmed, we see that the improvements obtained in performance with HyperSAGE over the best baselines are 6.3% and 4.3% respectively. Apart from its superior performance, HyperSAGE is also stable, and is less sensitive to the choice of data split and initialization of the weights. This is evident from the scores of standard deviation (SD) for the various experiments in Table 1 . We see that the SD scores for our method are lower than other methods, and there is a significant gain in performance compared to HyperGCN. Another observation is that the HyperGCN method is very sensitive to the data splits as well as initializations with very large errors in the predictions. This is even higher for the FastHyperGCN variant. Also, we have found that all the 4 choices of p work well with HyperSAGE for these datasets. We further perform a more comprehensive study analyzing the effect of p on model performance later in this section. Stability analysis. We further study the stability of our method in terms of the variance observed in performance for different ratios of train and test splits, and compare results with that of HyperGCN implemented under similar settings. Fig. 3 shows results for the two learning methods on 5 different train-test ratios. We see that the performance of both models improves when a higher fraction of data is used for training, and the performances are approx- DBLP Pubmed imately the same at the train-test ratio of 1/3. However, for smaller ratios, we see that HyperSAGE outperforms HyperGCN by a significant margin across all datasets. Further, the standard deviation for the predictions of HyperSAGE are significantly lower than that of HyperGCN. Clearly, this implies that HyperSAGE is able to better exploit the information contained in the hypergraph compared to HyperGCN, and can thus produce more accurate and stable predictions. Results on Cora and Citeseer can be found in Appendix C. α = 2 α = 3 α = 5 α = 10 α = 2 α = 3 α = 5 α = 10 p = - Effect of generalized mean aggregations and neighborhood sampling. We study here the effect of different choices of the aggregation functions F 1 (•) and F 2 (•) on the performance of the model. Further, we also analyze how the number of samples chosen for aggregation affect its performance. Aggregation functions from M p are chosen with p = 1, 2, 3, 4, 5, 0.01 and -1, and to comply with global neighborhood invariance, we use aggregation function as in Eq. 4. The number of neighbors α for intra-edge aggregation are chosen to be 2, 3, 5 and 10. Table 2 shows the accuracy scores obtained for different choices of p and α on DBLP and Pubmed datasets. For most cases, higher value of p reduces the performance of the model. For α = 2 on DBLP, performance seems to be independent of the choice of p. A possible explanation could be that the number of neighbors is very small, and change in p does not affect the propagation of information significantly. An exception is p = -1, where the performance drops for all cases. For Pubmed, the choice of p seems to be very important, and we find that p = 0.01 seems to fit best. We also see that the number of samples per aggregation can significantly affect the performance of the model. For DBLP, model performance increases with increasing value of α. However, for Pubmed, we observe that performance improves up to α = 5, but then a slight drop is observed for larger sets of neighbors. Note that for Pubmed, the majority of the hyperedges have cardinality less than or equal to 10. This means that during aggregation, information will most often be aggregated from all the neighbors, thereby involving almost no stochastic sampling. Stochastic sampling of nodes could serve as a regularization mechanism and reduce the impact of noisy hyperedges. However, at α = 10, it is almost absent, due to which the noise in the data affects the performance of the model which is not the case in DBLP.

4.3. INDUCTIVE LEARNING ON EVOLVING GRAPHS

For inductive learning experiment, we consider the case of evolving hypergraphs. We create 4 inductive learning datasets from DBLP, Pubmed, Citeseer and Core (co-citation) by splitting each of the datasets into a train-test ratio of 1:4. Further, the test data is split into two halves: seen and unseen. The seen test set comprises nodes that are part of the hypergraph used for representation learning. Further, unseen nodes refer to those that are never a part of the hypergraph during training. To study how well HyperSAGE generalizes for inductive learning, we classify the unseen nodes and compare the performance with the scores obtained on the seen nodes. Further, we also compare our results on unseen nodes with those of MLP+HLR. The results are shown in Table 3 . We see that results obtained with HyperSAGE on unseen nodes are significantly better than the baseline method. Further, these results seem to not differ drastically from those obtained on the seen nodes, thereby confirming that HyperSAGE can work with evolving graphs as well. b) Hyperedge e q is split into r hyperedges to reduce the cardinality of e q . Note that the global neighborhood of v i still remains the same, however its intra-edge neighborhood has changed due to such splitting. v i v 𝛄1 e 1 e 2 e q e 3 v 𝛄2 v 𝛄3 v 𝛄5 v 𝛄4 v 𝛄6 v 𝛄7 v 𝛄9 v 𝛄10 (a) F 2 (s) =    1 |E(v i )| e∈E(vi)   1 |N(v i , e)| vj ∈N(vi,e) x p1 j   p 2 p 1    1 p 2 (6) This equation can be rewritten as F2(s) =    1 |E(vi)|      1 |N(vi, eq)| v j ∈N(v i ,eq ) x p 1 j   p 2 p 1 + e∈E(v i ),e =eq   1 |N(vi, e)| v j ∈N(v i ,e) x p 1 j   p 2 p 1       1 p 2 (7) Further, let Ψ = e∈E(vi),e =eq   1 |N(v i , e)| vj ∈N(vi,e) x p1 j   p 2 p 1 , then Eq. 7 can be rewritten as F 2 (s) =    1 |E(v i )|      1 |N(v i , e q )| vj ∈N(vi,eq) x p1 j   p 2 p 1 + Ψ       1 p 2 (9) Let us assume now that hyperedge e q is split into r hyperedges given by E(v i , e q ) = {e q1 , e q2 . . . e qr }. Stating the aggregation on the new set of hyperedges as F2 (s), we assemble the contribution from this new set of hyperedges with added weight terms w j as stated below. F2 (s) =    1 |E(v i )|    e∈E(vi,eq)   1 |N(v i , e)| vj ∈N(vi,e) w j x p1 j   p 2 p 1 + Ψ       1 p 2 (10) For the property of global neighborhood invariance to hold at v i , the following condition should be satisfied: F 2 (v i ) = F2 (v i ). Based on this, we would like to solve for the weights w j . For this, we equate the two terms and obtain We further solve for the variables p 1 , p 2 and w j where Eq. 11 holds. For the sake of clarity, we first simplify Eq. 11 using the following substitutions: α = p2 p1 , β = 1 |N(vi,eq)| and β mj = wj |N(vi,em)| , where the index m here is used to refer to the m th hyperedge from among the r hyperedges obtained on splitting e q . Further, let z j = x p1 j for v j ∈ N(v i , e q ) and z mj = x p1 j for v j ∈N(v i , e m ) and e m ∈ E(v i , e q ). Based on these substitutions, Eq. 11 can be restated as We seek general solutions for w j and α which holds for all values of z j ∈ [0, 1] since every element in the normalized feature vectors x j lies in [0, 1]. β α ( For a generalized solution, the coefficients of z j on the right should be equal to the coefficient of z j on the left. The term on the left can be reformulated as β α (z 1 + z 2 + . . . + z N ) α = β α (z 1 + (z 2 + z 3 + . . . + z N )) α (13) Consider the case when |z 1 | ≤ |z 2 + z 3 + . . . |, we expand Eq. 13. using binomial expansion for real co-efficients, Without any loss of generality, we consider splitting of hyperedge e q into r hyperedges such that nodes v γ1 and v γ2 are not contained in the same hyperedge anymore. This implies that RHS in Eq. 14 should not contain product terms of z 1 and z 2 . Hence, the term z α-1 1 z 2 should be such that α -1 = 0 ⇒ α = 1 ⇒ p1 = p2 (15) β α (z 1 + (z 2 + z 3 + . . .)) α = β α ( α 0 z α 1 + α 1 z α- Putting α = 1 and comparing the coefficients in Eq.12, we get Thus, if an edge e q is split into multiple edges E(v i , e q ), then for the two aggregations to hold, the conditions are p 1 = p 2 and w j = ∀ e ∈ E(v i , e q ). While we provide above a description related to splitting a certain hyperedge e q into r hyperedges, the derived results can be used to compute global neighborhood itself on any given node v i . Similar to e q above, node v i together with its global neighborhood (counted as N(v i )) can be interpreted as a virtual hyperedge that has been split into a number of hyperedges that actually exist and contain v i . These resultant hyperdges are equivalent to the r hyperdges obtained after splitting, as stated above.



CONCLUSIONWe have proposed HyperSAGE, a generic neural message passing framework for inductive learning on hypergraphs. The proposed approach fully utilizes the inherent higher-order relations in a hypergraph structure without reducing it to a regular graph. Through experiments on several representative datasets, we have shown that HyperSAGE outperforms the other methods for hypergraph learning. Several variants of graph-based learning algorithm such as GCN and GraphSAGE can be derived from the flexible aggregation and neighborhood sampling framework, thus making HyperSAGE a universal framework for learning node representations on hypergraphs as well as graphs. HyperGCN Implementation: https://github.com/malllabiisc/HyperGCN



Figure2: Schematic representation of the twolevel message passing scheme of HyperSAGE, with aggregation functions F 1 (•) and F 2 (•). It shows information aggregation from two hyperedges e A and e B , where the intra-edge aggregation is from sampled sets of 5 nodes (α = 5) for each hyperedge. For node v i , x i and z i denote the input and encoded feature vector, respectively.

Figure 3: Accuracy scores for HyperSAGE and HyperGCN obtained for different train-test ratios for multi-class classification datasets.

Figure 4: (a) Example showing node v i shared across 4 hyperedges. (b) Hyperedge e q is split into r hyperedges to reduce the cardinality of e q . Note that the global neighborhood of v i still remains the same, however its intra-edge neighborhood has changed due to such splitting.

i , e q )| vj∈N(vi,eq)

z 2 + z 3 + . . . + z N )+ αz 1 (z 2 + z 3 + . . . + z N ) α-1 ) (14)

β = β 11 + β 12 + . . . + β 21 + β 22 . . . + β r1 + β r2 + . . . 1 |N(v i , e q )

and Z and Z * represent the encoded feature matrices obtained using F(•) on H and H * , the condition Z * = σ • Z holds. Here, σ denotes a permutation operator on hypergraphs.Property 2 (Global Neighborhood Invariance). A message aggregation scheme F(•) satisfies global neighborhood invariance at any node v i ∈ V for a given hypergraph H = (V, E, X) if for any operation Γ(•), such that H * = Γ(H), and z i and z * i denote the encoded feature vectors obtained using F(•) at node v i on H and H * , the condition z * i = z i holds. Here Γ(H) could refer to operations such as hyperedge contraction or expansion.

Performance of HyperSAGE and other hypergraph learning methods on co-authorship and co-citation datasets. Further, for all experiments, we use a neural network with 2 layers. All models are implemented in Pytorch and trained using Adam optimizer. See Appendix A.2 for implementation details.

Performance of HyperSAGE for multiple values of p in generalized means aggregator (M p ) on varying number of neighborhood samples (α).

Performance of HyperSAGE and its variants on nodes which were part of the training hypergraph (seen) and nodes which were not part of the training hypergraph (unseen).

z 1 + z 2 + . . . + z N ) α = (β 11 z 1 + β 12 z 2 + . . . + β 1j z j + . . . + β 1N z N ) α + (β 21 z 1 + β 22 z 2 + . . . + β 2j z j + . . . + β 2N z N ) α + . . . + (β r1 z 1 + β r2 z 2 + . . . + β rj z j + . . . + β rN z N ) α .

APPENDICES A EXPERIMENTS: ADDITIONAL DETAILS

We perform multi-class classification on co-authorship and co-citation datasets, where the task is to predict the topic (class) for each document.A.1 DATASET DESCRIPTION Hypergraphs are created on these datasets by assigning each document as a node and each hyperedge represents (a) all documents co-authored by an author in co-authorship dataset and (b) all documents cited together by a document in co-citation dataset. Each document (node) is represented by bagof-words features. The details about nodes, hyperedges and features is shown in Table 4 . We use the same dataset and train-test splits as provided by Yadati et al. (2019) in their publically available implementation 1 . • hidden layer size: 32• dropout rate: 0.5• learning rate: 0.01• weight decay: 0.0005• number of training epochs: 150• λ for explicit Laplacian regularisation: 0.001

B CHOICE OF INTER-EDGE AND INTRA-EDGE AGGREGATIONS

Proof. For any given hypergraph H 1 = (V, E 1 , X), let v i denote a node at which global neighborhood equivariance exists. The aggregation output F 1 (s) at v i can then be written using generalized means M p as(5)To reiterate here, s denotes the unordered set of input as shown in Eq. 5. Further, the inter-edge aggregation F 2 (•) can be stated as 

