EQUIVARIANT HYPERGRAPH DIFFUSION NEURAL OP-ERATORS

Abstract

Hypergraph neural networks (HNNs) using neural networks to encode hypergraphs provide a promising way to model higher-order relations in data and further solve relevant prediction tasks built upon such higher-order relations. However, higher-order relations in practice contain complex patterns and are often highly irregular. So, it is often challenging to design an HNN that suffices to express those relations while keeping computational efficiency. Inspired by hypergraph diffusion algorithms, this work proposes a new HNN architecture named ED-HNN, which provably approximates any continuous equivariant hypergraph diffusion operators that can model a wide range of higher-order relations. ED-HNN can be implemented efficiently by combining star expansions of hypergraphs with standard message passing neural networks. ED-HNN further shows great superiority in processing heterophilic hypergraphs and constructing deep models. We evaluate ED-HNN for node classification on nine real-world hypergraph datasets. ED-HNN uniformly outperforms the best baselines over these nine datasets and achieves more than 2%↑ in prediction accuracy over four datasets therein. Our code is available at: https://github.com/Graph-COM/ED-HNN.

1. INTRODUCTION

Machine learning on graphs has recently attracted great attention in the community due to the ubiquitous graph-structured data and the associated inference and prediction problems (Zhu, 2005; Hamilton, 2020; Nickel et al., 2015) . Current works primarily focus on graphs which can model only pairwise relations in data. Emerging research has shown that higher-order relations that involve more than two entities often reveal more significant information in many applications (Benson et al., 2021; Schaub et al., 2021; Battiston et al., 2020; Lambiotte et al., 2019; Lee et al., 2021) . For example, higher-order network motifs build the fundamental blocks of many real-world networks (Mangan & Alon, 2003; Benson et al., 2016; Tsourakakis et al., 2017; Li et al., 2017; Li & Milenkovic, 2017) . Session-based (multi-step) behaviors often indicate the preferences of web users in more precise ways (Xia et al., 2021; Wang et al., 2020; 2021; 2022) . To capture these higher-order relations, hypergraphs provide a dedicated mathematical abstraction (Berge, 1984) . However, learning algorithms on hypergraphs are still far underdeveloped as opposed to those on graphs. Recently, inspired by the success of graph neural networks (GNNs), researchers have started investigating hypergraph neural network models (HNNs) (Feng et al., 2019; Yadati et al., 2019; Dong et al., 2020; Huang & Yang, 2021; Bai et al., 2021; Arya et al., 2020) . Compared with GNNs, designing HNNs is more challenging. First, as aforementioned, higher-order relations modeled by hyperedges could contain complex information. Second, hyperedges in real-world hypergraphs are often of large and irregular sizes. Therefore, how to effectively represent higher-order relations while efficiently processing those irregular hyperedges is the key challenge when to design HNNs. In this work, inspired by the recently developed hypergraph diffusion algorithms (Li et al., 2020a; Liu et al., 2021b; Fountoulakis et al., 2021; Takai et al., 2020; Tudisco et al., 2021a) we design a novel HNN architecture that holds provable expressiveness to approximate a large class of hypergraph diffusion while keeping computational efficiency. Hypergraph diffusion is significant due to relations. The gradients of those potentials determine the diffusion process and are termed diffusion operators. Our ED-HNN can universally represent such operators by feeding node representations into the message from hyperedges to nodes. Such small changes make big differences in model performance. its transparency and has been widely applied to semi-supervised learning (Hein et al., 2013; Zhang et al., 2017a; Tudisco et al., 2021a) , ranking aggregation (Li & Milenkovic, 2017; Chitra & Raphael, 2019) , network analysis (Liu et al., 2021b; Fountoulakis et al., 2021; Takai et al., 2020) and signal processing (Zhang et al., 2019; Schaub et al., 2021) and so on. However, traditional hypergraph diffusion needs to first handcraft potential functions to model higher-order relations and then use their gradients or some variants of their gradients as the diffusion operators to characterize the exchange of diffused quantities on the nodes within one hyperedge. The design of those potential functions often requires significant insights into the applications, which may not be available in practice. We observe that the most commonly-used hyperedge potential functions are permutation invariant, which covers the applications where none of the nodes in a higher-order relation are treated as inherently special. For such potential functions, we further show that their induced diffusion operators must be permutation equivariant. Inspired by this observation, we propose a NN-parameterized architecture that is expressive to provably represent any permutation-equivariant continuous hyperedge diffusion operators, whose NN parameters can be learned in a data-driven way. We also introduce an efficient implementation based on current GNN platforms Fey & Lenssen (2019) ; Wang et al. (2019) : We just need to combine a bipartite representation (or star expansion Agarwal et al. (2006) ; Zien et al. (1999) , equivalently) of hypergraphs and the standard message passing neural network (MPNN) Gilmer et al. (2017) . By repeating this architecture by layers with shared parameters, we finally obtain our model named Equivariant Diffusion-based HNN (ED-HNN). Fig. 1 shows an illustration of hypergraph diffusion and the key architecture in ED-HNN. To the best of our knowledge, we are the first one to establish the connection between the general class of hypergraph diffusion algorithms and the design of HNNs. Previous HNNs were either less expressive to represent equivariant diffusion operators (Feng et al., 2019; Yadati et al., 2019; Dong et al., 2020; Huang & Yang, 2021; Chien et al., 2022; Bai et al., 2021) or needed to learn the representations by adding significantly many auxiliary nodes (Arya et al., 2020; Yadati, 2020; Yang et al., 2020) . We provide detailed discussion of them in Sec. 3.4. We also show that due to the capability of representing equivariant diffusion operators, ED-HNN is by design good at predicting node labels over heterophilic hypergraphs where hyperedges mix nodes from different classes. Moreover, ED-HNN can go very deep without much performance decay. As an extra theoretical contribution, our proof of expressiveness avoids using equivariant polynomials as a bridge, which allows precise representations of continuous equivariant set functions by compositing a continuous function and the sum of another continuous function on each set entry, while previous works (Zaheer et al., 2017; Segol & Lipman, 2020) have only achieved an approximation result. This result may be of independent interest for the community. We evaluate ED-HNN by performing node classsification over 9 real-world datasets that cover both heterophilic and homophilic hypergraphs. ED-HNN uniformly outperforms all baseline methods across these datasets and achieves significant improvement (>2% ↑) over 4 datasets therein. ED-HNN also shows super robustness when going deep. We also carefully design synthetic experiments to verify the expressiveness of ED-HNN to approximate pre-defined equivariant diffusion operators.

2. PRELIMINARIES: HYPERGRAPHS AND HYPERGRAPH DIFFUSION

Here, we formulate the hypergraph diffusion problem, along the way, introduce the notations. Definition 1 (Hypergraph). Let G = (V, E, X) be an attributed hypergraph where V, E are the node set and the hyperedge set, respectively. Each hyperedge e = {v (e) 1 , ..., v (e) |e| } is a subset of V. Unlike graphs, a hyperedge may contain more than two nodes. X = [..., x v , ...] T ∈ R N denotes the node attributes and x v denotes the attribute of node v. Define d v = |{e ∈ E : v ∈ e}| as the degree of node v. Let D, D e denote the diagonal degree matrix for v ∈ V and the sub-matrix for v ∈ e. Here, we use 1-dim attributes for convenient discussion while our experiments often have multidim attributes. Learning algorithms will combine attributes and hypergraph structures into (latent) features defined as follows, which can be further used to make prediction for downstream tasks. Definition 2 (Latent features). Let h v ∈ R denote the (latent) features of node v ∈ V. A widely-used heuristic to generate features H is via hypergraph diffusion algorithms. Definition 3 (Hypergraph Diffusion). Define node potential functions f (•; x v ) : R → R for v ∈ V and hyperedge potential functions g e (•) : R |e| → R for each e ∈ E. The hypergrah diffusion combines the node attributes and the hypergraph structure and asks to solve min H v∈V f (h v ; x v ) + e∈E g e (H e ). In practice, g e is often shared across hyperedges of the same size. Later, we ignore the subscript e. The two potential functions are often designed via heuristics in traditional hypergraph diffusion literatures. Node potentials often correspond to some negative-log kernels of the latent features and the attributes. For example, f (h v ; x v ) could be (h v -x v ) 2 2 when to compute hypergraph PageRank diffusion (Li et al., 2020a; Takai et al., 2020) . Hyperedge potentials are more significant and complex, as they need to model those higher-order relations between more than two objects, which makes hypergraph diffusion very different from graph diffusion. Here list a few examples. Example 1 (Hyperedge potentials). Some practical g(H e ) may be chosen as follows. • Clique Expansion (CE, hyperedges reduced to cliques) plus pairwise potentials (Zhou et al., 2007) : u,v∈e (h v -h u ) 2 2 or with degree normalization u,v∈e ( hv √ dv -hu √ du ) 2 2 (≜ g(D -1/2 e H e )). • Divergence to the mean (Tudisco et al., 2021a; b) : v∈e (h v -∥|e| -1 H e ∥ p ) 2 2 , where ∥ • ∥ p computes the ℓ p -norm. • Total Variation (TV) (Hein et al., 2013; Zhang et al., 2017a)  :max u,v∈e |h v -h u | p , p ∈ {1, 2}. • Lovász Extension (Lovász, 1983) for cardinality-based set functions (LEC) (Jegelka et al., 2013; Li et al., 2020a; Liu et al., 2021b) ⟨y, ς(H e )⟩ p , p ∈ {1, 2} where y = [..., y j , ...] ⊤ ∈ R |e| is a constant vector and ς(H e ) sorts the values of H e in a decreasing order. One may reproduce TV by using LEC and setting y 1 = -y |e| = 1. To reveal more properties of these hyperedge potentials g(•) later, we give the following definitions. Definition 4 (Permutation Invariance & Equivariance). Function ψ : R K → R is permutation- invariant if for any K-dim permutation matrix P ∈ Π[K], ψ(P Z) = ψ(Z) for all Z ∈ R K . Function ψ : R K → R K is permutation-equivariant if for any K-dim permutation matrix P ∈ Π[K], ψ(P Z) = P ψ(Z) for all Z ∈ R K . We may easily verify permutation invariance of the hyperedge potentials in Example 1. The underlying physical meaning is that the prediction goal of an application is independent of node identities in a hyperedge so practical g's often keep invariant w.r.t. the node ordering (Veldt et al., 2021) . 2017) and further used traditional graph methods. Later, researchers proved that those hyperedge reduction techniques cannot well represent higher-order relations (Li & Milenkovic, 2017; Chien et al., 2019) . Therefore, Lovász extensions of set-based cut-cost functions on hyperedges have been proposed recently and used as the potential functions (Hein et al., 2013; Li & Milenkovic, 2018; Li et al., 2020a; Takai et al., 2020; Fountoulakis et al., 2021; Yoshida, 2019) . However, designing those set-based cut costs is practically hard and needs a lot of trials and errors. Other types of handcrafted hyperedge potentials to model information propagation can also be found in (Neuhäuser et al., 2022; 2021) , which again are handcrafted and heavily based on heuristics and evaluation performance. Our idea uses data-driven approaches to model such potentials, which naturally brings us to HNNs. On one hand, we expect to leverage the extreme expressive power of NNs to learn the desired hypergraph diffusion automatically from the data. On the other hand, we are interested in having novel hypergraph NN (HNN) architectures inspired by traditional hypergraph diffusion solvers. To achieve the goals, next, we show by the gradient descent algorithm (GD) or alternating direction method of multipliers (ADMM) (Boyd et al., 2011) , solving objective Eq. 1 amounts to iteratively applying some hyperedge diffusion operators. Parameterizing such operators using NNs by each step can unfold hypergraph diffusion into an HNN. The roadmap is as follows. In Sec. 3.1, we make the key observation that the diffusion operators are inherently permutationequivariant. To universally represent them, in Sec. 3.2, we propose an equivariant NN-based operators and use it to build ED-HNN via an efficient implementation. In Sec. 3.3, we discuss the benefits of learning equivariant diffusion operators. In Sec. 3.4, we review and discuss that no previous HNNs allows efficiently modeling such equivariant diffusion operators via their architecture.

3.1. EMERGING EQUIVARIANCE IN HYPERGRAPH DIFFUSION

We start with discussing the traditional solvers for Eq. 1. If f and g are both differentiable, one straightforward optimization approach is to adopt gradient descent. The node-wise update of each iteration can be formulated as below: h (t+1) v ← h (t) v -η(∇f (h (t) v ; x v ) + e:v∈e [∇g(H (t) e )] v ), for v ∈ V, where [∇g(H e )] v denotes the gradient w.r.t. h v for v ∈ e. We use the supscript t to denote the number of the current iteration, h v = x v is the initial features, and η is known as the step size. For general f and g, we may adopt ADMM: For each e ∈ E, we introduce an auxiliary variable Q e = q e,1 • • • q e,|e| ⊤ ∈ R |e| . We initialize h (0) v = x v and Q (0) e = H e . And then, iterate Q (t+1) e ← prox ηg (2H (t) e -Q (t) e ) -H (t) e + Q (t) e , for e ∈ E, h (t+1) v ← prox ηf (•;xv)/dv ( e:v∈e q (t+1) e,v /d v ), for v ∈ V, where prox ψ (h) ≜ arg min z ψ(z) + 1 2 ∥z -h∥ 2 2 is the proximal operator. The detailed derivation can be found in Appendix A. The iterations have convergence guarantee under closed convex assumptions of f, g (Boyd et al., 2011) . However, our model does not rely on the convergence, as our model just runs the iterations with a given number of steps (aka the number of layers in ED-HNN). The operator prox ψ (•) has nice properties as reviewed in Proposition 1, which enables the possibility of NN-based approximation even for the case when f and g are not differentiable. One noted non-differentiable example is the LEC case when g(H e ) = ⟨y, ς(H e )⟩ p in Example 1. Proposition 1 (Parikh & Boyd (2014); Polson et al. (2015) ). If ψ(•) : R K → R is a lower semicontinuous convex function, then prox ψ (•) is 1-Lipschitz continuous. The node-side operations, gradient ∇f (•; x v ) and proximal gradient prox ηf (•;xv) (•) are relatively easy to model, while the operations on hyperedges are more complicated. We name gradient ∇g(•) : R |e| → R |e| and proximal gradient prox ηg (•) : R |e| → R |e| as hyperedge diffusion operators, since they summarize the collection of node features inside a hyperedge and dispatch the aggregated information to interior nodes individually. Next, we reveal one crucial property of those hyperedge diffusion operators by the following propostion (see the proof in Appendix B.1): Proposition 2. Given any permutation-invariant hyperedge potential function g(•), hyperedge diffusion operators prox ηg (•) and ∇g(•) are permutation equivariant. Algorithm 1: ED-HNN Initialization: H (0) = X and three MLPs φ, ρ, φ (shared across L layers). For t = 0, 1, 2, ..., L -1, do: 1. Designing the messages from V to E: m (t) u→e = φ(h (t) u ), for all u ∈ V. 2. Sum V → E messages over hyperedges m (t) e = u∈e m (t) u→e , for all e ∈ E.

3.. Broadcast m (t)

e and design the messages from E to V: m (t) e→v = ρ(h (t) v , m (t) e ), for all v ∈ e. 4. Update h (t+1) v = φ(h (t) v , e:u∈e m (t) e→u , x v , d v ), for all v ∈ V. It states that an permutation invariant hyperedge potential leads to an operator that should process different nodes in a permutation-equivariant way.

3.2. BUILDING EQUIVARIANT HYPEREDGE DIFFUSION OPERATORS

Our design of permutation-equivariant diffusion operators is built upon the following Theorem 1. We leave the proof in Appendix B.2. Theorem 1. ψ(•) : [0, 1] K → R K is a continuous permutation-equivariant function, if and only if it can be represented as [ψ(Z)] i = ρ(z i , K j=1 ϕ(z j )), i ∈ [K] for any Z = [..., z i , ...] ⊤ ∈ [0, 1] K , where ρ : R K ′ → R, ϕ : R → R K ′ -1 are two continuous functions, and K ′ ≥ K. Remark on an extra theoretical contribution: Theorem 1 indicates that any continuous permutation-equivariant operators where each entry of the input Z has 1-dim feature channel can be precisely written as a composition of a continuous function ρ and the sum of another continuous function ϕ on each input entry. This result generalizes the representation of permutation-invariant functions by ρ( (Zaheer et al., 2017) to the equivariant case. An architecture in a similar spirit was proposed in (Segol & Lipman, 2020) . However, their proof only allows approximation, i.e., small ∥ψ(Z) -ψ(Z)∥ instead of precise representation in Theorem 1. Also, Zaheer et al. (2017) ; Segol & Lipman (2020) ; Sannai et al. (2019) focus on representations of a single set instead of a hypergraph with coupled sets. K i=1 ϕ(z i )) in The above theoretical observation inspires the design of equivariant hyperedge diffusion operators. Specifically, for an operator ψ(•) : R |e| → R |e| that may denote either gradient ∇g(•) or proximal gradient prox ηg (•) for each hyperedge e, we may parameterize it as: [ ψ(H e )] v = ρ(h v , u∈e φ(h u )), for v ∈ e, where ρ, φ are multi-layer perceptions (MLPs). ( 5) Intuitively, the inner sum collects the φ-encoding node features within a hyperedge and then ρ combines the collection with the features from each node further to perform separate operation. The implementation of the above ψ is not trivial. A naive implementation is to generate an auxillary node to representation each (v, e)-pair for v ∈ V and e ∈ E and learn its representation as adopted in (Yadati, 2020; Arya et al., 2020; Yang et al., 2020) . However, this may substantially increase the model complexity. Our implementation is built upon the bipartite representations (or star expansion (Zien et al., 1999; Agarwal et al., 2006) , equivalently) of hypergraphs paired with the standard message passing NN (MPNN) (Gilmer et al., 2017) that can be efficiently implemented via GNN platforms (Fey & Lenssen, 2019; Wang et al., 2019) or sparse-matrix multiplication. Specifically, we build a bipartite graph Ḡ = ( V, Ē). The node set V contains two parts V ∪ V E where V is the original node set while V E contains nodes that correspond to original hyperedges e ∈ E. Then, add an edge between v ∈ V and e ∈ E if v ∈ e. With this bipartite graph representation, the model ED-HNN is implemented by following Algorithm 1. The equivariant diffusion operator ψ can be constructed via steps 1-3. The last step is to update the node features to accomplish the first two terms in Eq. 2 or the ADMM update Eq. 4. We leave a more detailed discussion on how Algorithm 1 aligns with the GD and ADMM updates in Sec. C. The initial attributes x v and node degrees are included to match the diffusion algorithm by design. As the diffusion operators are shared across iterations, ED-HNN shares parameters across layers. Now, we summarize the joint contribution of our theory and efficient implementation as follows. Proposition 3. MPNNs on bipartite representations (or star expansions) of hypergraphs are expressive enough to learn any continuous diffusion operators induced by invariant hyperedge potentials. Simple while Significant Architecture Difference. Some previous works (Dong et al., 2020; Huang & Yang, 2021; Chien et al., 2022) also apply GNNs over bipartite representations of hypergraphs to build their HNN models, which look similar to ED-HNN. However, ED-HNN has a simple but significant architecture difference: Step 3 in Algorithm 1 needs to adopt h (t) v as one input to compute the message m (t) e→v . This is a crucial step to guarantee that the hyperedge operation by combining steps 1-3 forms an equivariant operator from {h (t) v } v∈e → {m (t) e→v } v∈e . However, previous models often by default adopt m (t) e→v = m (t) e , which leads to an invariant operator. Such a simple change is significant as it guarantees universal approximation of equivariant operators, which leads to many benefits as to be discussed in Sec. 3.3. Our experiments will verify these benefits. Extension. Our ED-HNN can be naturally extended to build equivariant node operators because of the duality between hyperedges and nodes, though this is not necessary in the traditional hypergraph diffusion problem Eq. 1. Specifically, we call this model ED-HNNII, which simple revises the step 1 in ED-HNN as m (t) u→e = φ(h (t) u , m (t-1) e ) for any e such that u ∈ e. Due to the page limit, we put some experiment results on ED-HNNII in Appendix F.1.

3.3. ADVANTAGES OF ED-HNN IN HETEROPHILIC SETTINGS AND DEEP MODELS

Here, we discuss several advantages of ED-HNN due to the design of equivariant diffusion operators. Heterophily describes the network phenomenon where nodes with the same labels and attributes are less likely to connect to each other directly (Rogers, 2010) . Predicting node labels in heterophilic networks is known to be more challenging than that in homophilic networks, and thus has recently become an important research direction (Pei et al., 2020; Zhu et al., 2020; Chien et al., 2021; Lim et al., 2021) . Heterophily has been proved as a more common phenomenon in hypergraphs than in graphs since it is hard to expect all nodes in a giant hyperedge to share a common label (Veldt et al., 2022) . Moreover, predicting node labels in heterophilic hypergraphs is more challenging than that in graphs as a hyperedge may consist of the nodes from multiple categories. Learnable equivariant diffusion operators are expected to be superior in predicting heterophilic node labels. For example, if a hyperedge e of size 3 is known to cover two nodes v, u from class C 1 while one node w from class C 0 . We may use the LEC potential ⟨y, ς(H e )⟩ 2 by setting the parameter y = [1, -1, 0] ⊤ . Suppose the three nodes' attibutes are H e = [h v , h u , h w ] ⊤ = [0.7, 0.5, 0.3] ⊤ , where h v , h u are close as they are from the same class. One may check the hyperedge diffusion operator gives ∇g(H e ) = [0.4, -0.4, 0] ⊤ . One-step gradient descent h v -η[∇g(H e )] v with a proper step size (η < 0.5) drags h v and h u closer while keeping h w unchanged. However, invariant diffusion by forcing the operation on hyperedge e to follow [∇g(H e )] v = [∇g(H e )] u = [∇g(H e )] w (invariant messages from e to the nodes in e) will allocate every node with the same change and cannot deal with the heterogeneity of node labels in e. Moreover, a learnable operator is important. To see this, Suppose we have different ground-truth labels, say w, u from the same class and v from the other. Then, using the above parameter y may increase the error. However, learnable operators can address the problem, e.g., by obtaining a more suitable parameter y = [0, 1, -1] ⊤ via training. Moreover, equivariant hyperedge diffusion operators are also good at building deep models. GNNs tend to degenerate the performance when going deep, which is often attributed to their oversmoothing (Li et al., 2018; Oono & Suzuki, 2019 ) and overfitting problems (Cong et al., 2021) . Equivariant operators allocating different messages across nodes helps with overcoming the oversmoothing issue. Moreover, diffusion by sharing parameters across layers may reduce the risk of overfitting.

3.4. RELATED WORKS: PREVIOUS HNNS FOR REPRESENTING HYPERGRAPH DIFFUSION

ED-HNN is the first HNN inspired by the general class of hypergraph diffusion and can provably achieve universal approximation of hyperedge diffusion operators. So, how about previous HNNs representing hypergraph diffusion? We temporarily ignore the big difference in whether parameters are shared across layers and give some analysis as follows. HGNN (Feng et al., 2019) runs graph convolution (Kipf & Welling, 2017) on clique expansions of hypergraphs, which directly assumes the hyperedge potentials follow CE plus pairwise potentials and cannot learn other operations on hyperedges. HyperGCN (Yadati et al., 2019) essentially leverages the total variation potential by adding mediator nodes to each hyperedge, which adopts a transformation technique of the total variantion potential in (Chan et al., 2018; Chan & Liang, 2020) . GHSC (Zhang et al., 2022) to represent a specific edge-dependent-node-weight hypergraph diffusion proposed in (Chitra & Raphael, 2019) . A few works view each hyperedge as a multi-set of nodes and each node as a multiset of hyperedges (Dong et al., 2020; Huang & Yang, 2021; Chien et al., 2022; Bai et al., 2021; Arya et al., 2020; Yadati, 2020; Yang et al., 2020; Jo et al., 2021) . Among them, HNHN (Dong et al., 2020) , UniGNNs (Huang & Yang, 2021) , EHGNN (Jo et al., 2021) and AllSet (Chien et al., 2022) build two invariant set-pooling functions on both the hyperedge and node sides, which cannot represent equivariant functions. HCHA (Bai et al., 2021) and UniGAT (Yang et al., 2020) compute attention weights as a result of combining node features and hyperedge messages, which looks like our ρ operation in step 3. However, a scalar attention weight is too limited to represent the potentially complicated ρ. HyperSAGE (Arya et al., 2020) , MPNN-R (Yadati, 2020) and LEGCN (Yang et al., 2020) may learn hyperedge equivariant operators in principle by adding auxillary nodes to represent node-hyperedge pairs as aforementioned. However, because of the extra complexity, these models are either too slow or need to reduce the sizes of parameters to fit in memory, which constrains their actual performance in practice. Of course, none of these works have mentioned any theoretical arguments on universal representation of equivariant hypergraph diffusion operators as ours. One subtle point is that Uni{GIN,GraphSAGE,GCNII} (Huang & Yang, 2021 ) adopt a jump link in each layer to pass node features from the former layer directly to the next and may expect to use a complicated node-side invariant function ψ′ (h (t) v , e:v∈e m (t) e ) to approximate our ψ(h (t) v , e:v∈e m (t) e→v ) = ψ(h (t) v , e:v∈e ρ(h (t) v , m (t) e )) in Step 4 of Algorithm 1 directly. This may be doable in theory. However, the jump-link solution may expect to have a higher dimension of m 2021), Congress and Senate Fowler (2006b; a) . More details of these datasets can be found in Appendix F.2. Since the last four hypergraphs do not contain node attributes, we follow the method of Chien et al. (2022) to generate node features from label-dependent Gaussian distribution Deshpande et al. (2018) . As we show in Table 1 , these datasets already cover sufficiently diverse hypergraphs in terms of scale, structure, and homo-/heterophily. We compare our method with top-performing models on these benchmarks, including HGNN (Feng et al., 2019) , HCHA (Bai et al., 2021) , HNHN (Dong et al., 2020) , HyperGCN (Yadati et al., 2019) , UniGCNII (Huang & Yang, 2021) , AllDeepSets (Chien et al., 2022) , AllSetTransformer (Chien et al., 2022) , and a recent diffusion method Hy-perND (Prokopchik et al., 2022) . All the hyperparameters for baselines follow from (Chien et al., 2022) and we fix the learning rate, weight decay and other training recipes same with the baselines. Other model specific hyperparameters are obtained via grid search (see Appendix F.3). We randomly split the data into training/validation/test samples using 50%/25%/25% splitting percentage by following Chien et al. (2022) . We choose prediction accuracy as the evaluation metric. We run each model for ten times with different training/validation splits to obtain the standard deviation. In Appendix F.7, we also provide the analysis of the model sensitivity to different hidden dimensions. Performance Analysis. Table 2 shows the results. Our ED-HNN uniformly outperforms all the compared models on all the datasets. We observe that the top-performing baseline models are AllSetTransformer, AllDeepSets and UniGCNII. As having been analyzed in Sec. 3.4, they model invariant set functions on both node and hyperedge sides. UniGCNII also adds initial and jump links, which accidentally resonates with our design principle (the step 4 in Algorithm 1). However, their performance has large variation across different datasets. For example, UniGCNII attains promising performance on citation networks, however, has subpar results on Walmart dataset. In contrast, our model achieves stably superior results, surpassing AllSet models by 12.9% on Senate and UniGCNII by 12.5% on Walmart. We owe our empirical significance to the theoretical design of exact equivariant function representation. Compared with HyperND, the SOTA hypergraph diffusion algorithm, our ED-HNN achieves superior performance on every dataset. HyperND provably reaches convergence to the "divergence to the mean" hyperedge potential (see Example 1), which gives it good performance on homophilic networks while only subpar performance on heterophilic datasets. Regarding the computational efficiency, we report the wall-clock times for training (for 100 epochs) and testing of all models over the largest hypergraph Walmart, where all models use the same hyperparameters that achieve the reported performance on the left. Our model achieves efficiency comparable to AllSet models (Chien et al., 2022) and is much faster than UniGCNII. So, the implementation of equivariant computation in ED-HNN is still efficient.

4.2. RESULTS ON SYNTHETIC HETEROPHILIC HYPERGRAPH DATASET

Experiment Setting. As discussed, ED-HNN is expected to perform well on heterophilic hypergraphs. We evaluate this point by using synthetic datasets with controlled heterophily. We generate data by using contextual hypergraph stochastic block model (Deshpande et al., 2018; Ghoshdastidar & Dukkipati, 2014; Lin & Wang, 2018) . Specifically, we draw two classes of 2, 500 nodes each and then randomly sample 1,000 hyperedges. Each hyperedge consists of 15 nodes, among which α i many are sampled from class i. We use α = min{α 1 , α 2 } to denote the heterophily level. Afterwards, we generate label-dependent Gaussian node features with standard deviation 1.0. We test both homophilic (α = 1, 2 or CE homophily≥ 0.7) and heterophilic (α = 4 ∼ 7 or CE homophily ≤ 0.7) cases. We compare ED-HNN with HGNN, AllSet models and their variants with jump links. We follow previous 50%/25%/25% data splitting methods and repeated 10 times the experiment. Results. Table 3 shows the results. On homophilic datasets, all the models can achieve good results, while ED-HNN keeps slightly better than others. Once α surpasses 3, i.e., entering the heterophilic regime, the superiority of ED-HNN is more obvious. The jump-link trick indeed also helps, while building equivariance as ED-HNN does directly provides more significant improvement. 

4.3. BENEFITS IN DEEPENING HYPERGRAPH NEURAL NETWORKS

We also demonstrate that by using diffusion models and parameter tying, ED-HNN can benefit from deeper architectures, while other HNNs cannot. Fig. 2 illustrates the performance of different models versus the number of network layers. We compare with HGNN, AllSet models , and UniGCNII. UniGCNII inherits from (Chen et al., 2020) which is known to be effective to counteract oversmoothness. The results reveal that AllSet models suffer from going deep. HGNN working through more lightweight mechanism has better tolerance to depth. However, none of them can benefit from deepening. On the contrary, ED-HNN successfully leverages deeper architecture to achieve higher accuracy. For example, adding more layers boosts ED-HNN by ˜1% in accuracy on Pubmed and House, while elevating ED-HNN from 58.01% to 64.79% on Senate dataset.

4.4. EXPRESSIVENESS JUSTIFICATION ON THE SYNTHETIC DIFFUSION DATASET

We are to evaluate the ability of ED-HNN to express given hypergraph diffusion. We generate semi-synthetic diffusion data using the Senate hypergraph (Chodrow et al., 2021) and synthetic node features. The data consists of 1,000 pairs (H (0) , H (1) ). The initial node features H (0) are sampled from 1-dim Gaussian distributions. To obtain H (1) we apply the gradient step in Eq. 2. For non-differential cases, we adopt subgradients for convenient computation. We fix node potentials as f (h v ; x v ) = (h v -x v ) 2 and consider 3 different edge potentials in Example 1 with varying complexities: a) CE, b) TV (p=2) and c) LEC. The goal is to let one-layer models V → E → V to recover H (1) . We compare ED-HNN with our implemented baseline (Invariant) with parameterized invariant set functions on both the node and hyperedge sides, and AllSet models (Chien et al., 2022) that also adopt invariant set functions. We keep the scale of all models almost the same to give fair comparison, of which more details are given in Appendix F.5. The results are reported in Fig. 3 . The CE case gives invariant diffusion so all models can learn it well. The TV case mostly shows the benefit of the equivariant architecture, where the error almost does not increase even when the dimension decreases to 32. The LEC case is challenging for all models, though ED-HNN is still the best. The reason, we think, is that learning the sorting operation in LEC via the sum pooling in Eq. 5 is empirically challenging albeit theoretically doable. A similar phenomenon has been observed in previous literatures (Murphy et al., 2018; Wagstaff et al., 2019) .

5. CONCLUSION

This work introduces a new hypergraph neural network ED-HNN that can model hypergraph diffusion process. We show that any hypergraph diffusion with permutation-invariant potential functions can be represented by iterating equivariant diffusion operators. ED-HNN provides an efficient way to model such operators based on the commonly-used GNN platforms. ED-HNN shows superiority in processing heterophilic hypergraphs and constructing deep models. For future works, ED-HNN can be applied to other tasks such as regression tasks or to develop counterpart implicit models. A DERIVATION OF EQ. 3 AND 4 We derive iterative update Eq. 3 and 4 as follows. Recall that our problem is defined as: min H v∈V f (h v ; x v ) + e∈E g e (H e ) where H ∈ R N is the node feature matrix. H e = h e,1 • • • h e,|e| ⊤ ∈ R |e| collects the associated node features inside edge e, where h e,i corresponds to the node features of the i-th node in edge e. We use x v to denote the (initial) attribute of node v. Here, neither f (•; x v ) nor g e (•) is necessarily continuous. To solve such problem, we borrow the idea from ADMM (Boyd et al., 2011) . We introduce an auxiliary variable R e = H e for every e ∈ E and reformulate the problem as a constrained optimization problem: min H v∈V f (h v ; x v ) + e∈E g e (R e ) subject to R e -H e = 0, ∀e ∈ E By the Augmented Lagrangian Method (ALM), we assign a Lagrangian multiplier S e (scaled by 1/λ) for each edge, then the objective function becomes: max {Se} e∈E min H,{Re} e∈E v∈V f (h v ; x v ) + e∈E g(R e ) + λ 2 e∈E ∥R e -H e + S e ∥ 2 F - λ 2 e∈E ∥S e ∥ 2 F . (6) We can iterate the following primal-dual steps to optimize Eq. 6. The primal step can be computed in a block-wise sense: R (t+1) e ← arg min Re λ 2 R e -H (t) e + S (t) e 2 F + g(R e ) = prox g/λ H (t) e -S (t) e , ∀e ∈ E, h (t+1) v ← arg min hv λ 2 e:v∈e h v -S (t) e,v -R (t+1) e,v F + f (h v ; x v ) = prox f (•;xv)/λdv e:v∈e (S (t) e,v + R (t+1) e,v ) d v , ∀v ∈ V. ( ) The dual step can be computed as: S (t+1) e ← S (t) e + R (t+1) e -H (t+1) e , ∀e ∈ E. (9) Denote Q (t+1) e := S (t) e + R (t+1) e and η := 1/λ, then the iterative updates become: Q (t+1) e = prox ηg (2H (t) e -Q (t) e ) + Q (t) e -H (t) e , for e ∈ E, h (t+1) v = prox ηf (•;xv)/dv e:e∈v Q (t+1) e,v d v , for v ∈ V. ( ) B DEFERRED PROOFS B.1 PROOF OF PROPOSITION 2 Proof. Define π : [K] → [K] be an index mapping associated with the permutation matrix P ∈ Π(K) such that P Z = z π(1) , • • • , z π(K) ⊤ . To prove that ∇g(•) is permutation equivariant, we show by using the definition of partial derivatives. For any Z = [z 1 • • • z K ] ⊤ , and permutation π, we have: [∇g(P Z)] i = lim δ→0 g(z π(1) , • • • , z π(i) + δ, • • • , z π(K) ) -g(z π(1) , • • • , z π(i) , • • • , z π(K) ) δ = lim δ→0 g(z 1 , • • • , z π(i) + δ, • • • , z K ) -g(z 1 , • • • , z π(i) , • • • , z K ) δ = [∇g(Z)] π(i) , where the second equality Eq. 12 is due to the permutation invariance of g(•). To prove Proposition 2 for the proximal gradient, we first define: H * = prox g (Z) = arg min H g(H) + 1 2 ∥H -Z∥ 2 F for some Z. For arbitrary permutation matrix P ∈ Π(K), we have prox g (P Z) = arg min H g(H) + 1 2 ∥H -P Z∥ 2 F = arg min H g(H) + 1 2 ∥P (P ⊤ H -Z)∥ 2 F = arg min H g(H) + 1 2 ∥P ⊤ H -Z∥ 2 F = arg min H g(P ⊤ H) + 1 2 ∥P ⊤ H -Z∥ 2 F (13) = P H * , where Eq. 13 is due to the permutation invariance of g(•).

B.2 PROOF OF THEOREM 1

Proof. To prove Theorem 1, we first summarize one of our key results in the following Lemma 2. Lemma 2. ψ(•) : [0, 1] K → R K is a permutation-equivariant function if and only if there is a function ρ(•) : [0, 1] K → R that is permutation invariant to the last K -1 entries, such that [ψ(Z)] i = ρ(z i , z i+1 , • • • , z K , • • • , z i-1 K-1 ) for any i. Proof. (Sufficiency) Define π : [K] → [K] be an index mapping associated with the permuta- tion matrix P ∈ Π(K) such that P Z = z π(1) , • • • , z π(K) ⊤ . Then [ψ(z π(1) , • • • , z π(K) )] i = ρ(z π(i) , z π(i+1) , • • • , z π(K) , • • • , z π(i-1) ). Since ρ(•) is invariant to the last K -1 entries, [ψ(P Z)] i = ρ(z π(i) , z π(i)+1 , • • • , z K , • • • , z π(i)-1 ) = [ψ(Z)] π(i) . (Necessity) Given a permutation-equivariant function ψ : [0, 1] K → R K , we first expand it to the following form: [ψ(Z)] i = ρ i (z 1 , • • • , z K ). Permutation-equivariance means ρ π(i) (z 1 , • • • , z K ) = ρ i (z π(1) , • • • , z π(K) ). Suppose given an index i, consider any permutation π : [K] → [K], where π(i) = i. Then, we have ρ i (z 1 , • • • , z i , • • • , z K ) = ρ π(i) (z 1 , • • • , z i , • • • , z K ) = ρ i (z π(1) , • • • , z i , • • • , z π(K) ) , which implies ρ i : R K → R must be invariant to the K -1 elements other than the i-th element. Now, consider a permutation π where π(1) = i. Then ρ i (z 1 , z 2 , • • • , z K ) = ρ π(1) (z 1 , z 2 , • • • , z K ) = ρ 1 (z π(1) , z π(2) , • • • , z π(K) ) = ρ 1 (z i , z i+1 , • • • , z K , • • • , z i-1 ) , where the last equality is due to our previous argument. This implies two results. First, for all i, ρ i (z 1 , z 2 , • • • , z i , • • • , z K ), ∀i ∈ [K] should be written in terms of ρ 1 (z i , z i+1 , • • • , z K , • • • , z i-1 ). Moreover, ρ 1 is permutation invariant to its last K -1 entries. Therefore, we just need to set ρ = ρ 1 and broadcast it accordingly to all entries. We conclude the proof. To proceed the proof, we bring in the following mathematical tools (Zaheer et al., 2017) : Definition 5. Given a vector z = [z 1 , • • • , z K ] ⊤ ∈ R K , we define power mapping ϕ M : R → R M as ϕ M (z) = z z 2 • • • z M ⊤ , and sum-of-power mapping Φ M : R K → R M as Φ M (z) = K i=1 ϕ M (z i ), where M is the largest degree. Lemma 3. Let X = {[z 1 , • • • , z K ] ⊤ ∈ [0, 1] K such that z 1 < z 2 < • • • < z K }. We define mapping φ : R → R K+1 as φ(z) = z 0 z 1 z 2 • • • z K ⊤ , and mapping Φ : R K → R K+1 as Φ(z) = K i=1 φ(z i ), where M is the largest degree. Then Φ restricted on X , i.e., Φ : X → R K+1 , is a homeomorphism. Proof. Proved in Lemma 6 in (Zaheer et al., 2017) . We note that Definition 5 is slightly different from the mappings defined in Lemma 3 (Zaheer et al., 2017) as it removes the constant (zero-order) term. Combining with Lemma 3 (Zaheer et al., 2017) and results in (Wagstaff et al., 2019) , we have the following result: Lemma 4. Let X = {[z 1 , • • • , z K ] ⊤ ∈ [0, 1] K such that z 1 < z 2 < • • • < z K }, then there exists a homeomorphism Φ M : X → R M such that Φ M (z) = K i=1 ϕ M (z i ) where ϕ M : R → R M if M ≥ K. Proof. For M = K, we choose ϕ K and Φ K to be the power mapping and power sum with largest degree K defined in Definition 5. We note that Φ(z) = K Φ K (z) ⊤ ⊤ . Since K is a constant, there exists a homeomorphism between the images of Φ(z) and Φ K (z). By Lemma 3, Φ : X → R K+1 is a homeomorphism, which implies Φ K (z) : X → R K is also a homeomorphism. For M > K, we first pad every input z ∈ X with a constant k > 1 to be an M -dimension ẑ ∈ R M . Note that padding is homeomorphic since k is a constant. All such ẑ form a subset X ′ ⊂ {[z 1 , • • • , z M ] ⊤ ∈ [0, k] M such that z 1 < z 2 < • • • < z M }. We choose φ : R → R M to be power mapping and Φ : X ′ → R M to be sum-of-power mapping restricted on X ′ , respectively. Following (Wagstaff et al., 2019) , we construct Φ M (z) as below: ΦM ( ẑ) = K i=1 φM (z i ) + M i=K+1 φM (k) = K i=1 φM (z i ) + M i=1 φM (k) - K i=1 φM (k) (14) = K i=1 ( φM (z i ) -φM (k)) + M i=1 φM (k) = K i=1 ( φM (z i ) -φM (k)) + M φM (k), K i=1 ( φM (z i ) -φM (k)) = ΦM ( ẑ) -M φM (k). Let ϕ M (z) = φM (z i ) -φM (k), and Φ M (z) = K i=1 ϕ M (z i ). Since [0, k] is naturally homeomorphic to [0, 1], by our argument for M = K, ΦM : X ′ → R M is a homeomorphism. This implies Φ M : X → R M is also a homeomorphism. It is straightforward to show the sufficiency of Theorem 1 by verifying that for arbitrary permutation π : [K] → [K], [ψ(z π(1) , • • • , z π(K) )] i = ρ(z π(i) , K j=1 ϕ(z j )) = [ψ(z 1 , • • • , z K )] π(i) . With Lemma 2 and 4, we can conclude the necessity of Theorem 1 by the following construction: 1. By Lemma 2, any permutation equivariant function ψ(z 1 , • • • , z K ) can be written as [ψ(•)] i = τ (z i , z i+1 , • • • , z K , • • • , z i-1 ) such that τ (•) is invariant to the last K -1 ele- ments. 2. By Lemma 4, we know that there exists a homeomorphism mapping Φ K which is continuous, invertible, and invertibly continuous. For arbitrary z ∈ [0, 1] K , the difference between z and Φ -1 K • Φ K (z) is up to a permutation. 3. Since τ (•) is permutation invariant to the last K -1 elements, we can construct the func- tion τ (z i , z i+1 , • • • , z K , • • • , z i-1 ) = τ (z i , Φ -1 K-1 • Φ K-1 (z i+1 , • • • , z K , • • • , z i-1 )) = ρ(z i , j̸ =i ϕ K-1 (z j )) = ρ(z i , K j=1 ϕ K-1 (z j )), where ρ(x, y) = τ (x, Φ -1 K-1 (y)) and ρ(x, y) = ρ(x, y -ϕ K-1 (x)). Since τ , Φ -1 K-1 , and ϕ K-1 are all continuous functions, their composition ρ is also continuous. Remark 1. Sannai et al. (2019)[Corollary 3.2] showed similar results to our Theorem 1. However, their provided exact form of equivariant set function is written as ρ(z i , j̸ =i ϕ(z j )). We note the difference: the summation in Theorem 1 pools over all the nodes inside a hyperedge while the summation in (Sannai et al., 2019) [Corollary 3.2] pools over elements other than the central node. To implement the latter equation in hypergraphs which have coupled sets, one has to maintain each node-hyperedge pair, which is computationally prohibitive in practice. An easier proof can be shown by combining results in (Sannai et al., 2019) with our proof technique introduced in the Step 3. C ALIGNING ALGORITHM 1 WITH GD/ADMM UPDATES ED-HNN can be regarded as an unfolding algorithm to optimize objective Eq. 1, where f and g are node and edge potentials, respectively. Essentially, each iteration of ED-HNN simulates a one-step equivariant hypergraph diffusion. Each iteration of ED-HNN has four steps. In the first step, ED-HNN point-wisely transforms the node features via an MLP ϕ. Next ED-HNN sum-pools the node features onto the associated hyperedges as m e = v∈e ϕ(h v ). The third step is crucial to achieve equivariance, where each node sum-pools the hyperedge features after interacting its feature with every connected hyperedge: e:v∈e ρ(h v , m e ). The last step is to update the feature information via another node-wse transformation φ. According to our Theorem 1, ρ(h v , u∈e ϕ(h u )) can represent any equivariant function. Therefore, the messages aggregated from hyperedges ρ(h v , m e ) can be learned to approximate the gradient ∇g in Eq. 2. ED-HNN can well approximate the GD algorithm (Eq. 2), while ED-HNN does not perfectly match ADMM unless we assume Q (t) e = H (t) e in Eq. 3 for a practical consideration. This assumption may reduce the performance of the hypergraph diffusion algorithm while our model ED-HNN has already achieved good enough empirical performance. Tracking Q (t) e means recording the messages from E to V for every iteration, which is not supported by the current GNN platforms (Fey & Lenssen, 2019; Wang et al., 2019) and may consume more memory. We leave the study of the algorithms that can track the update of Q (t) e as a future study. A co-design of the algorithm and the system may be needed to guarantee the scalability of the algorithm.

D INVARIANT DIFFUSION VERSUS EQUIVARIANT DIFFUSION

Although our Proposition 2 states hypergraph diffusion operator should be inherently equivariant, many exising frameworks design it to be invariant. In invariant diffusion, hypergraph diffusion operator ∇g(•) or prox ηg(•) corresponds to a permutation invariant function rather than a permutation equivariant function (see Definition 4). Mathematically, ∇g(H e ) = 1 ⊤ ĝ(H e ) for some invariant function ĝ : R |e|×F → R F . In contrast to equivariant diffusion, the invariant diffusion operator summarizes the node information to one feature vector and passes this identical message to all nodes uniformly. We provide Fig. 4 to illustrate the key difference between invariant diffusion and equivariant diffusion. Considering implementing such an invariant diffusion by DeepSet (Zaheer et al., 2017) , an invariant diffusion should have the following message passing: The difference in Example: Invariant Diffusion by DeepSet Initialization: H (0) = X. For t = 0, 1, 2, ..., L -1, do: 1. Designing the messages from V to E: m  (t+1) v = φ(h (t) v , e:u∈e m (t) e→u , x v , d v ), for all v ∈ V. the third step is worth noting. ρ is independent of the target node features such that each hyperedge passes every node an identical message. HyperGCN (Yadati et al., 2019) , HGNN (Feng et al., 2019) and other GCNs running on clique expansion can all be reduced to invariant diffusion.

E FURTHER DISCUSSION ON OTHER RELATED WORKS

Our ED-HNN is actually inspired by a line of works on optimization-inspired NNs. Optimizationinspired NNs often get praised for their interpretability (Monga et al., 2021; Chen et al., 2021b) and certified convergence under certain conditions of the learned operators (Ryu et al., 2019; Teodoro et al., 2017; Chan et al., 2016) . Optimization-inspired NNs mainly focus on the applications such as compressive sensing (Gregor & LeCun, 2010; Xin et al., 2016; Liu & Chen, 2019; Chen et al., 2018) , speech processing (Hershey et al., 2014; Wang et al., 2018) , image denoising (Zhang et al., 2017b; Meinhardt et al., 2017; Chang et al., 2017; Chen & Pock, 2016) , partitioning (Zheng et al., 2015; Liu et al., 2017 ), deblurring (Schuler et al., 2015; Li et al., 2020b ) and so on. Only a few works recently have applied optimization-inspired NNs to graphs as discussed in Sec. 3.4, while our work is the first one to study optimization-inspired NNs on hypergraphs. Parallel to our hypergraph diffusion inspired HNNs, there are also some models to represent diffusion on graphs. However, as there are only pairwise relations in graphs, the model design is often much simpler than that for hypergraphs. In particular, to construct an expressive operator with for pairwise relations is trivial (corresponding to the case K = 2 in Theorem 1). These models unroll either learnable (Yang et al., 2021a; Chen et al., 2021a) , fixed ℓ 1 -norm (Liu et al., 2021c) or fixed ℓ 2 -norm pair-wise potentials (Klicpera et al., 2019) to formulate GNNs. Some works also view the graph diffusion as an ODE (Chamberlain et al., 2021; Thorpe et al., 2022) , which essentially corresponds to the gradient descent formulation Eq. 2. Implicit GNNs (Dai et al., 2018; Liu et al., 2021a; Gu et al., 2020; Yang et al., 2021b) are to directly parameterize the optimum of an optimization problem while implicit methods are missing for hypergraphs, which is a promising future direction.

F ADDITIONAL EXPERIMENTS AND IMPLEMENTATION DETAILS F.1 EXPERIMENTS ON ED-HNNII

As we described in Sec. 3.2, an extension to our ED-HNN is to consider V → E message passing and E → V message passing as two equivariant set functions. The detailed algorithm is illustrated in Algorithm 2, where the red box highlights the major difference with ED-HNN. Algorithm 2: ED-HNNII Initialization: H (0) = X and three MLPs φ, ρ, φ (shared across L layers). For t = 0, 1, 2, ..., L -1, do: 1. Designing the messages from V to E: m (t) u→e = φ(m (t-1) u→e , h (t) u ), for all u ∈ V. 2. Sum V → E messages over hyperedges m (t) e = u∈e m (t) u→e , for all e ∈ E.

3.. Broadcast m (t)

e and design the messages from E to V: m (t) e→v = ρ(h (t) v , m (t) e ), for all v ∈ e. 4. Update h (t+1) v = φ(h (t) v , e:u∈e m (t) e→u , x v , d v ), for all v ∈ V. In our tested datasets, hyperedges do not have initial attributes. In our implementation, we assign a common learnable vector for every hyperedge as their first-layer features m (0) u→e . The performance of ED-HNNII and the comparison with other baselines are presented in Table 4 . Our finding is that ED-HNNII can outperform ED-HNN on datasets with relatively larger scale or more heterophily. And the average accuracy improvement by ED-HNNII is around 0.5%. We argue that ED-HNNII inherently has more complex computational mechanism, and thus tends to overfit on small datasets (e.g., Cora, Citeseer, etc.). Moreover, superior performance on heterophilic datasets also implies that injecting more equivariance benefits handling heterophilic data. From Table 4 , the measured computational efficiency is also comparable to ED-HNN and other baselines. (Fowler, 2006a) and Senate (Fowler, 2006b) ). For existing datasets, we downloaded the processed version by Chien et al. (2022) . For co-citation networks (Cora, Citeseer, Pubmed), all documents cited by a document are connected by a hyperedge (Yadati et al., 2019) . For co-authorship networks (Cora-CA, DBLP), all documents co-authored by an author are in one hyperedge (Yadati et al., 2019) . The node features in citation networks are the bag-of-words representations of the corresponding documents, and node labels are the paper classes. In the House dataset, each node is a member of the US House of Representatives and hyperedges group members of the same committee. Node labels indicate the political party of the representatives. (Chien et al., 2022) . In Walmart, nodes represent products purchased at Walmart, hyperedges represent sets of products purchased together, and the node labels are the product categories (Chien et al., 2022) . For Congress (Fowler, 2006a) and Senate (Fowler, 2006b) , we used the same setting as (Veldt et al., 2022) . In Congress dataset, nodes are US Congresspersons and hyperedges are comprised of the sponsor and co-sponsors of legislative bills put forth in both the House of Representatives and the Senate. In Senate dataset, nodes are US Congresspersons and hyperedges are comprised of the sponsor and co-sponsors of bills put forth in the Senate. Each node in both datasets is labeled with political party affiliation. Both datasets were from James Fowler's data (Fowler, 2006b; a) . We also list more detailed statistical information on the tested datasets in Table 5 .

F.3 HYPERPARAMETERS FOR BENCHMARKING DATASETS

For a fair comparison, we use the same training recipe for all the models. For baseline models, we precisely follow the hyperparameter settings from (Chien et al., 2022) . For ED-HNN, we adopt Adam optimizer with fixed learning rate=0.001 and weight decay=0.0, and train for 500 epochs for all datasets. The standard deviation is reported by repeating experiments on ten different data splits. We fix the input dropout rate to be 0.2, and dropout rate to be 0. ). The initial node features H (0) are sampled from 1-dim Gaussian distributions with mean 0 and variance randomly drawn between 1 and 100. That is, to generate a single instance of H (0) , we first pick σ uniformly from [1,10], and then sample the coordinate entries as h (0) v ∼ N (0, σ 2 ). Then we apply the gradient step in Eq. 2 to obtain the corresponding H (1) . For non-differentiable node or edge potentials, we adopt subgradients for convenient computation. We fix the node potential as f (h v ; x v ) = (h v -x v ) 2 where x v ≡ h (0) v . We consider 3 different edge potentials from Example 1 with varying complexities: a) CE, b) TV (p = 2) and c) LEC (p = 2). For LEC, we set y as follows: if |e| is even, then y i = 2/|e| if i ≤ |e|/2 and -2/|e| otherwise; if |e| is odd, then y i = 2/(|e| -1) if i ≤ (|e| -1)/2, y i = 0 if i = (|e| + 1)/2, and y 0 = -2/(|e| -1) otherwise. In order to apply the gradient step in Eq. 2 we need to specify the learning rate η. We choose η in a way such that Var(H (1) )/Var(H (0) ) does not vary too much among the three different edge potentials. Specifically, we set η = 0.5 for CE, η = 0.02 for TV and η = 0.1 for LEC. Beyond semi-synthetic diffusion data generated from the gradient step Eq. 2, we also considered synthetic diffusion data obtained from the proximal operators Eq. 3 and 4. We generated a random uniform hypergraph with 1,000 nodes and 1,000 hyperedges of constant hyperedge size 20. The diffusion data on this hypergraph consists of 1,000 pairs (H (0) , H (1) ). The initial node features H (0) are sampled in the same way as before. We apply the updates given by Eq. 3 and 4 to obtain H (1) . We consider the same node potential and 2 edge potentials TV (p = 2) and LEC (p = 2). We set η = 1/2 for both cases. We show the results in Figure 5 . The additional results resonate with our previous results in Figure 3 . Again, our ED-HNN outperforms other baseline HNNs by a significant margin when hidden dimension is limited.

F.6 MORE COMPLEXITY ANALYSIS

In this section, we provide more efficiency comparison with HyperSAGE (Arya et al., 2020) and LEGCN (Yang et al., 2020) . As we discussed in 3.4, HyperSAGE and LEGCN may be able to learn hyperedge equivariant operator. However, both HyperSAGE and LEGCN need to build nodehyperedge pairs, which is memory consuming and unfriendly to efficient message passing implementation. To demonstrate this point, we compare our model with HyperSAGE and LEGCN in terms of training and inference efficiency. We reuse the official code provided in OpenReview to benchmark HyperSAGE. Since LEGCN has not released code, we reimplement it using Pytorch Geometric Library. We can only test the speed of these models on Cora dataset, because HyperSAGE and LEGCN cannot scale up as there is no efficient implementation yet for their computational prohibitive preprocessing procedures. The results are presented in Table 8 . We note that HyperSAGE cannot easily employ message passing between V and E, since neighbor aggregation step in Hyper-SAGE needs to rule out the central node. Its official implementation adopts a naive "for"-loop for each forward pass which is unable to fully utilize the GPU parallelism and significant deteriorates the speed. LEGCN can be implemented via message passing, however, the graphs expanded over node-edge pairs are much denser thus cannot scale up to larger dataset.



Figure 1: Hypergraph diffusion often uses permutation-invariant hyperedge potentials to model higher-order

∪ e:v∈e e|) so that the sum pooling e:v∈e m (t) e does not lose anything from the neighboring nodes of v before interacting with h (t) v , while in ED-HNN, m (t) e needs dim=|e| according to Theorem 1. Our empirical experiments also verify more expressiveness of ED-HNN. 4 EXPERIMENTS 4.1 RESULTS ON BENCHMARKING DATASETS Experiment Setting. In this subsection, we evaluate ED-HNN on nine real-world benchmarking hypergraphs. We focus on the semi-supervised node classification task. The nine datasets include co-citation networks (Cora, Citeseer, Pubmed), co-authorship networks (Cora-CA, DBLP-CA) Yadati et al. (2019), Walmart Amburg et al. (2020), House Chodrow et al. (

Figure 3: Comparing the Powers to Represent known Diffusion: MAE v.s. Latent Dimensions.

Figure 4: Comparing Invariant Diffusion and Equivariant Diffusion. The direction of the arrows indicates the information flow. The same color of arrows denotes the same passed message while different color of arrows means different passed messages.

u ), for all u ∈ V. 2. Sum V → E messages over hyperedges m (t) e = u∈e m (t) u→e , for all e ∈ E. 3. Broadcast m (t) e and design the messages from E to V: m (t) e→v = ρ(m (t) e ), for all v ∈ e. 4. Update h

Figure 5: Comparing the Powers to Represent known Diffusion (using ADMM with the proximal operators in Eq. 3 and 4): MAE v.s. Latent Dimensions

The feature vector H = [..., h v , ...] ⊤ v∈V ∈ R N includes node features as entries. Further, collect the features into a hyperedge feature vector H e = h e,1

• • • h e,|e| ⊤ ∈ R |e| , where h e,i corresponds to the feature h v (e) For any v ∈ e, there is one corresponding index i ∈ {1, ..., |e|}. Later, we may use subscripts (e, i) and (e, v) interchangeably if they cause no confusion.

is Dataset statistics. CE homophily is the homophily score(Pei et al., 2020) based on CE of hypergraphs. Here, four hypergraphs Congress, Senate, Walmart, House are heterophilic.

Prediction Accuracy (%). Bold font † highlights when ED-HNN significantly (difference in means > 0.5 × std) outperforms all baselines. The best baselines are underlined. The training and testing times are test on Walmart by using the same server with one GPU NVIDIA RTX A6000.

Prediction Accuracy (%) over Synthetic Hypergraphs with Controlled Heterophily α.

Additional experiments and updated leaderboard with ED-HNNII. Prediction accuracy (%). Bold font highlights when ED-HNNII outperforms the original ED-HNN. Other details are kept consistent with Table2.

More dataset statistics. CE homophily is the homophily score(Pei et al., 2020) based on CE of hypergraphs.

ACKNOWLEDGMENTS

We would like to express our deepest appreciation to Dr. David Gleich, Dr. Kimon Fountoulakis for the insightful discussion on hypergraph computation, and Dr. Eli Chien for the constructive advice on doing the experiments.

annex

LayerNorm for each layer similar to (Chien et al., 2022) . Other parameters regarding model sizes are obtained by grid search, which are enumerated in Table 6 . The search range of layer number is {1, 2, 4, 6, 8} and the hidden dimension is {96, 128, 256, 512}. We find the model size is proportional to the dataset scale, and in general heterophilic data need deeper architecture. For ED-HNNII, due to the inherent model complexity, we need to prune model depth and width to fit each dataset.Algorithm 3: Contextual Hypergraph Stochastic Block Model Initialization: Empty hyperedge set E = ∅. Draw vertex set V 1 of 2,500 nodes with class 1. Draw vertex set V 2 of 2,500 nodes with class 2. For i = 0, 1, 2, ..., 1, 000, do:1. Sample a subset e 1 with α 

F.4 SYNTHETIC HETEROPHILIC DATASETS

We use the contextual hypergraph stochastic block model (Deshpande et al., 2018; Ghoshdastidar & Dukkipati, 2014; Lin & Wang, 2018) to synthesize data with controlled heterophily. The generated graph contains 5,000 nodes and two classes in total, and 2,500 nodes for each class. We construct hyperedges by randomly sampling α 1 nodes from class 1, and α 2 nodes from class 2 without replacement. Each hyperedge has a fixed cardinality |e| = α 1 + α 2 = 15. We draw 1,000 hyperedges in total. The detailed data synthesis pipeline is summarized in Algorithm 3. We use α = min{α 1 , α 2 } to characterize the heterophily level of the hypergraph. For a more intuitive illustration, we list the CE homophily corresponding to different α in Table 7 . Experiments on synthetic heterophilic datasets fix the training hyperparameters and hidden dimension=256 to guarantee a fair parameter budget (˜1M). Baseline HNNs are all of one-layer architecture as they are not scalable with the depth shown in Sec. 4.3. Since our ED-HNN adopts parameter sharing scheme, we can easily repeat the diffusion layer twice to achieve better results without parameter overheads.

F.5 SYNTHETIC DIFFUSION DATASETS AND ADDITIONAL EXPERIMENTS

In order to evaluate the ability of ED-HNN to express given hypergraph diffusion, we generate semi-synthetic diffusion data using the Senate hypergraph (Chodrow et al., 2021) and synthetic node Published as a conference paper at ICLR 2023 Table 8 : Performance and Efficiency Comparison with HyperSAGE and LEGCN. The prediction accuracy of HyperSAGE is copied from the original manuscript (Arya et al., 2020) .HyperSAGE (Arya et al., 2020) LEGCN (Yang et al., 2020) ED We conduct expressivity vs hidden size experiments in Tab. 9, where we control the hidden dimension of hypergraph models and test their performance on Pubmed dataset. We choose top-performers AllDeepSets and AllSetTransformer as the compared baselines. On real-world data, our 64-width ED-HNN can even achieve on-par results with 512-width AllSet models. This implies our ED-HNN has better tolerance of low hidden dimension and we owe this to higher expressive power by achieving an equivariance.

