HIGHER-ORDER STRUCTURE PREDICTION IN EVOLV-ING GRAPH SIMPLICIAL COMPLEXES Anonymous

Abstract

Dynamic graphs are rife with higher-order interactions, such as co-authorship relationships and protein-protein interactions in biological networks, that naturally arise between more than two nodes at once. In spite of the ubiquitous presence of such higher-order interactions, limited attention has been paid to the higher-order counterpart of the popular pairwise link prediction problem. Existing higher-order structure prediction methods are mostly based on heuristic feature extraction procedures, which work well in practice but lack theoretical guarantees. Such heuristics are primarily focused on predicting links in a static snapshot of the graph. Moreover, these heuristic-based methods fail to effectively utilize and benefit from the knowledge of latent substructures already present within the higher-order structures. In this paper, we overcome these obstacles by capturing higher-order interactions succinctly as simplices, model their neighborhood by face-vectors, and develop a nonparametric kernel estimator for simplices that views the evolving graph from the perspective of a time process (i.e., a sequence of graph snapshots). Our method substantially outperforms several baseline higherorder prediction methods. As a theoretical achievement, we prove the consistency and asymptotic normality in terms of Wasserstein distance of our estimator using Stein's method.

1. INTRODUCTION

Numerous types of networks like social (Liben-Nowell & Kleinberg, 2007a) , biological (Airoldi et al., 2006) , and chemical reaction networks (Wegscheider, 1911) are highly dynamic, as they evolve and grow rapidly via the appearance of new interactions, represented as the introduction of new links / edges between the nodes of a network. Identifying the underlying mechanisms by which such networks evolve over time is a fundamental question that is not yet fully understood. Typically, insight into the temporal evolution of networks has been obtained via a classical inferential problem called link prediction, where given a snapshot of the network at time t along with its linkage pattern, the task is to assess whether a pair of nodes will be linked at a later time t > t. While inferring pairwise links is an important problem, it is oftentimes observed that most of the real-world graphs exhibit higher-order group-wise interactions that involve more than two nodes at once. Examples illustrating human group behavior involve a co-author relationship on a single paper and a network of e-mails to multiple recipients. In nature too, one can observe several proteins interacting together in a biological network simultaneously. In spite of their significance, in comparison to single edge inference, relatively fewer works have studied the problem of predicting higher-order group-wise interactions. Benson et al. (2018) originally introduced a simplex to model group-wise interactions between nodes in a graph. They proposed predicting a simplicial closure event, whereby an open simplex (with just pairwise interactions between member vertices) transitions to a closed simplex (where all member vertices participate in the higher-order relationship simultaneously), in the near future. Figure 1 (Middle) shows an example of such a transition from an open triangle to a closed one. Recently, several works have proposed modeling higher-order interactions as hyperedges in a hypergraph (Xu et al., 2013; Zhang et al., 2018; Yoon et al., 2020; Patil et al., 2020) . Given a hyperedge h t at time t, the inference task is to predict the future arrival of a new hyperedge h t , which covers a larger set of vertices than h t and contains all the vertices in h t . Figure 1 (Right) illustrates this hyperedge prediction task. Although prediction models based on either simplicial closure event prediction or hyperedge arrival, deal with higher-order structures, they both fail to capture the highly complex and non-linear evolution of higher-order structures over time. Both these kinds of models have two major limitations. First, they predict structures from a single static snapshot of the graph, thus not viewing the evolution process of adding new edges as a time process. Second, their feature extraction is mostly based on popular heuristics (Adamic & Adar, 2003; Brin & Page, 2012; Jeh & Widom, 2002; Zhou et al., 2009a; Barabási & Albert, 1999; Bhatia et al., 2019) that work well in practice but are not accompanied by strong theoretical guarantees. In addition to the aforementioned shortcomings, hypergraph based methods model higher-order structures as hyperedges, which omit lower-dimensional substructures present within a single hyperedge. As a consequence, they cannot distinguish between various substructure relationships. For example, hyperedge [A, B, C] in Figure 1 (Right) cannot distinguish between group relationships like [[A, B], [B, C], [A, C]] (a set of pairwise interactions) versus [A, B, C] (all A, B and C simultaneously in a relationship). This problem is remedied by the use of simplices because they naturally model these substructures as a collection of subsets (i.e., faces) of the simplex. We provide real-world examples of where our simplicial complex based approach can play a significant role. (i) Organic Chemistry: It is quite common to have the same set of elements interacting with each other in different configurations, which result in very different functioning compounds (Ma et al., 2011) . Specifically, R-thalidomide and S-thalidomide are two different configurations of thalidomide, where the R-form was meant to help sedate pregnant women, while the S-form unfortunately resulted in birth defects. This is a famous example in stereo chemistry to show the consequences of mistaking two extremely close configurations (differing by a single bond) as being the same. Structure prediction to avoid such phenomenon in drug synthesis allows chemists to achieve a much higher yield and avoid wastage of expensive resources. (ii) Gene expression networks: Gene networks have nodes that represent genes and edges connect genes which have similar expression patterns (Zhang & Horvath, 2005) . Subgraphs called modules are tightly connected genes in such a gene expression network. Genomics research provides evidence that higher-order gene expression relationships (like second and third-order) and their measurements can have very important implications for cancer prognosis. When making structural predictions in these aforementioned examples, our simplicial complex based approach provides much more fine-grained control over competing methods by capturing subtler differences in configurations. To combat these challenges, our approach views the evolving graphfoot_0 as a time process under the framework of nonparametric time series prediction, which models the evolution of higher-order structures (as simplices) and their local neighborhoods (spatial dimension) over a moving time window (temporal dimension). Our inference problem is then modeled as predicting the evolution to a higher-dimensional simplex at time t > t, given a simplex at time t. It is important to note that this task is more general and greatly diverges from the task proposed by Benson et al. (2018) . Our task requires just a single simplex σ in order to predict a higher-dimensional simplex τ , whose face / subset is σ, whereas Benson et al. (2018) requires the presence of all constituent σ faces in order to predict τ . For example, in Benson et al. (2018) (also shown in Figure 1 To this effect, we succinctly capture the features characterizing the local neighborhood of a simplex as a combination of a face-vector (Björner & Kalai, 2006) , which is a well-established vector signature in combinatorial topology literature and a novel scoring function which infers the affinity of sub-simplices based on the strength of their past interactions. Based on these features, we design a kernel estimator to infer future evolution to higher-dimensional simplices and prove both the consistency and asymptotic normality of our estimator. Our contributions: (a) We propose a kernel estimator that predicts higher-order interactions in an evolving network. (b) We prove the consistency and asymptotic normality of our kernel estimator. (c) We evaluate our method on real-world dynamic networks by proposing higher-order link prediction baselines and observe significant gains in prediction accuracy in comparison to the baselines.

1.1. RELATED STUDIES

Single link prediction: Most literature that predicts a single edge/link can be broadly classified as based on: (i) heuristics, (ii) random-walks, or (iii) graph neural networks (GNNs). (i) Heuristic methods comprise of Common neighbors, Adamic-adar (Adamic & Adar, 2003) , PageRank (Brin & Page, 2012) , SimRank (Jeh & Widom, 2002) , resource allocation (Zhou et al., 2009a) , preferential attachment (Barabási & Albert, 1999) , persistence homology based ranking (Bhatia et al., 2019) , and similarity-based methods (Liben-Nowell & Kleinberg, 2007b; Lü & Zhou, 2011) . (ii) Random walk based methods consist of DeepWalk (Perozzi et al., 2014) , Node2Vec (Grover & Leskovec, 2016b) and SpectralWalk (Sharma et al., 2020) . (iii) Finally, for both link prediction and node classification tasks, recent works are mainly GNN-based methods such as VGAE (Kipf & Welling, 2016) , WYS (Abu-El-Haija et al., 2018) , and SEAL (Zhang & Chen, 2018b ). Higher-order link prediction: Benson et al. (2018) are the first to introduce a higher-order link prediction problem where they study the likelihoods of future higher-order group interactions as simplicial closure events (explained earlier). Furthermore, there are studies using hypergraphs which also help naturally represent group relations (Xu et al., 2013; Zhang et al., 2018; Yoon et al., 2020; Patil et al., 2020) . Especially, to represent higher-order relationships, Yoon et al. (2020) proposed n-projected graphs. For larger n, i.e., higher-order groups, the enumeration of subsets, and keeping track of node co-occurrences quickly becomes infeasible. In comparison to a hypergraph, our graph simplicial complex is closed under taking subsets, which enables us to better encode more information for improved inference. We start with a general notion of an abstract simplicial complex (ASC), then define a simplex using ASCs. We specialize this definition to graphs and define a graph simplicial complex (GSC).

2. PRELIMINARY: GRAPH SIMPLICIAL COMPLEX (GSC)

Definition 1 (Abstract simplicial complex and simplex). An abstract simplicial complex (ASC) is a collection A of finite non-empty sets, such that if σ is an element of A, then so is every nonempty subset of σ. The element σ of A is called a simplex of A; its dimension is one less than the number of its elements. Now, we analyze graphs using the definition of ASCs. Let G = (V, E) be a finite graph with vertex set V and edge set E. A graph simpicial complex (GSC) G on G is an ASC consisting of subsets of V . In particular, G is a collection of subgraphs of G. With graphs, we denote a ddimensional simplex (or d-simplex) of a GSC by σ d) is called a face of σ (d) . (d) = [v 0 , v 1 , . . . , v d ]. Each non-empty subset of σ ( We define several notions related to GSCs, that are useful for describing the evolution of graphs. Definition 2 (Filtered GSC). For I ⊂ N, a filtered GSC indexed over I is a family (G t ) t∈I of GSCs such that for every t ≤ t in I, G t ⊂ G t holds. Obviously, G t0 ⊂ G t1 ⊂ . . . G tn is a discrete filtration induced by the arrival times of the simplices: G ti \ G ti-1 = σ ti . This depicts a higher-order analogue of an evolving graph (incremental model), which allows attaching new simplices at each time-step to an existing GSC to build a new GSC. A filtered GSC G t,p for the last p discrete time steps is defined as G t,p := (G t ) t t =t-p = (G t-p , . . . , G t ), where G t ⊃ G t -1 . We define a notion for dealing with the neighborhood around a given simplex σ (d) ∈ G. We introduce a set of all simplices of dimension d or less from G, i.e., G (d ) - := {σ (d) ∈ G | d ≤ d }. G (0) -is a vertex set and G (1) -is the set of edges and vertices. We also write i ∼ j whenever vertices i and j are adjacent in G (1) -, and write i ∼ k j to indicate that vertex j is k-reachable from i, i.e., there exists a path of length at most k, connecting i and j in G (1) -. Then, we define a ball around σ (d) . Definition 3 (k-ball centered at vertex and simplex). At time t, we define a k-ball centered at vertex i by B t,k (i) := {j : i ∼ k j and i, j ∈ G (0) t-}. We also define a k-ball centered at a simplex σ (d) as B t,k (σ (d) ) := i:Vert(σ (d) ) B t,k (i), where Vert(σ (d) ) denotes all vertexes in σ (d) . Now, we define a sub-complex G t (σ (d) ) ⊆ G t as the GSC that contains all the simplices in G t spanned by the vertices in the k-ball B t,k (σ (d) ).

3. PREDICTING THE ARRIVAL OF HIGHER-ORDER SIMPLICES

We consider the prediction of a simplex's arrival in the setting described below. Consider a filtered GSC G t,p . At time t, given a d-dimensional simplex σ d) ). Here, we restrict v to be k-reachable from σ (d) . In order to find out which simplices and vertices are most likely to appear in τ (d+1) at time t + 1, we need to design features for σ (d) and v. (d) = [v 0 , • • • , v d ], we predict the formation of a (d + 1)-simplex τ (d+1) = [v 0 , • • • , v d , v] with a new vertex v ∈ B t,k (σ (

3.1. FEATURE DESIGN FOR SIMPLEX

We develop a feature design of a simplex associated with the notion of k-balls. The design is organized into two main elements: (i) a face-vector with a sub-complex, and (ii) a scoring function. (i) Face vector of sub-complex: We first define a face-vector of a fixed GSC. The face-vector is an important topological invariantfoot_1 of the GSC. Definition 4 (face-vector). A combinatorial statistic of G is the face-vector (or f -vector) as f (G) = (f -1 , f 0 , . . . , f d-1 ), where f k = f k (G) records the number of k-dimensional faces σ (k) ∈ G. We then define the feature of a simplex σ (d) at time t, denoted by N t (σ (d) ) = f (G t (σ (d) )). In words, the feature is compactly represented as the face-vector of sub-complex G t (σ (d) ). Our facevector representation of a node's neighborhood can be considered as a higher-order analogue of the Weisfeiler-Lehman (WL) kernel Shervashidze et al. (2011) on unlabeled graphs, which for each vertex, iteratively aggregates the vertex degrees of its immediate neighbors to compute a unique vector of the target vertex that captures the structure of its extended neighborhood. (ii) Scoring function: The purpose of this function is to extract the features of v using its proximity to σ (d) . To this end, we begin by describing affinity between two vertices. Given two vertices  (d) = [v 0 , ..., v d ] d-dimensional simplex (d-simplex) τ (d+1) = [v 0 , ..., v d , v] d + 1-simplex for prediction G t,p = {G t-p , ..., G t } GSCs from the previous p time steps G (d ) - = {σ (d) ∈ G | d ≤ d } set of simplices of dimension d or less Local simplex B t,k (i) = {j : i ∼ k j, i, j ∈ G (0) t-} k-ball centered at vertex i B t,k (σ (d) ) = i:Vert(σ (d) ) B t,k (i) k-ball centered at simplex σ (d) G t (σ (d) ) all simplices from G t spanned by vertices in B t,k (σ (d) ) Feature of simplex f k = f k (G) total number of k-simplices σ (k) ∈ G f (G) = (f -1 , f 0 , ..., f d-1 ) face-vector of G N t (σ (d) ) = f (G t (σ (d) )) feature of σ (d) s(v, v ) weighted sum of past co-occurrences of v, v ∈ σ (d) h t (σ (d) , v) = d i=0 s(v i , v) scoring function F t (σ (d) , v) = (N t (σ (d) ), h t (σ (d) , v)) feature vector P t (σ (d) , F ) set of σ (d) , v with corresponding feature F v, v ∈ G -, we denote by s(v, v ) the weighted sum of all past co-occurrences of vertices v and v in σ (d) , where the weight is d from σ (d) . We then devise a scoring function h(•, •) that assigns an integral score to the possible introduction of a vertex v to a d-simplex σ (d) = [v 0 , • • • , v d ] as h t (σ (d) , v) = d i=0 s(v i , v). (1) Intuitively, the scoring function describes higher co-occurrence between σ (d) and v at time t, indicates a higher likelihood of forming a (d + 1)-simplex τ (d+1) = [σ (d) , v] together at a future time t + 1. We give a higher score to past co-occurrences of vertex pairs in higher dimensional simplices. Feature vector: Finally, for a given d-simplex σ (d) at time t, and a possible introduction of a new vertex v ∈ B t,k (σ (d) ), we assign a feature vector F t (σ (d) , v) = (N t (σ (d) ), h t (σ (d) , v)). We denote the set of all such possible (σ (d) , v) pairs with their corresponding feature vectors equal to F as P t (σ (d) , F ). Furthermore, among the pairs in P t (σ (d) , F ), we denote by P τ t (σ (d) , F ) those set of pairs with feature vectors equal to F that actually form τ (d+1) = [σ (d) , v] at time t, i.e., σ (d) appears as a face in τ (d+1) at time t. Note the distinction that not all d-simplices counted in P t (σ (d) , F ) end up being promoted to higher (d + 1)-simplices in the next time step. Table 1 provides a list of notations.

3.2. PREDICTION MODEL AND KERNEL ESTIMATOR

For the prediction, we define an indicator variable that displays the appearance of a new simplex. Given a d-simplex σ (d) = [v 0 , • • • , v d ] ∈ G t , the arrival at time t + 1 of a (d + 1)-simplex τ (d+1) = [v 0 , • • • , v d , v] with a new vertex v ∈ B t,k (σ (d) ) is captured by the following variable Y t+1 (τ (d+1) ) := 1 if σ (d) is a face of τ (d+1) , 0 otherwise. ( ) Prediction model: Our approach for the prediction is to model the indicator variable. Namely, we assume that the indicator variable follows the following distribution: Y t+1 (τ (d+1) ) | G t,p ∼ Bernoulli(g(F t (σ (d) , v))) where 0 ≤ g(•) ≤ 1 is a function of the feature vector F t (σ (d) , v). In words, the indicator variable Y t+1 (τ (d+1) ) is Bernoulli distributed with success probability given by function g(F t (σ (d) , v)), conditioned on having seen the last p states of GSC G t . This model describes that the appearance probabilities for two simplices σ i and σ j are likely to be similar, if their feature vector is also similar. Estimator with Kernels: We utilize a kernel method to estimate the success probability of the model (Equation 4) based on observed simplices at time t. Let G t (d) be a set of d-dimensional simplices at time t from G t,p . Also, for brevity, let F represent a feature F t (σ (d) , v m ) with some t, σ (d) and v m subject to v m ∈ B t,k (σ (d) i ). Let F -F 1 denote the L 1 -distance between two feature vectors, and also define a L 1 -ball Γ(F, δ) := {F : F -F 1 ≤ δ}. We define our kernel function K(•, •) as follows. With two features F and F , we define it as K (F, F ) := I {F = F } + βI { F -F 1 ≤ δ} 1 + β|Γ(F, δ)| , where β > 0 is the bandwidth parameter, and I is the indicator function. Given the feature F with only integer components, we are interested in only those close by feature vectors that are either exactly the same as f or lie within an L 1 -ball of radius δ centered on f . This explains the choice of our kernel function with discrete indicator variables. Now, we define our estimator. At time T , we fix σ (d) and v and set F = F T (σ (d) , v), our estimator of g(F ) in (Equation 4) is written as follows: g T (F ) := T t =T -p σ (d) j ∈G t (d) vn∈B t ,k (σ (d) j ) K F, F t (σ (d) j , v n ) • Y t +1 ([σ (d) j , v n ]) T t =T -p σ (d) j ∈G t (d) vn∈B t ,k (σ (d) j ) K F, F t (σ (d) j , v n ) . Time-complexity of our estimator: As both the face-vectors and counts P t+1 / P (τ ) t+1 (updated in a data cube) are computed simultaneously, they incur the same time overhead. At time t, for a d- d) , where V k and E k are the set of vertices and edges in the k-hop subgraph of a vertex. We must check this k-ball of σ (d) against p|G  simplex σ (d) it takes O(|V k | + |E k |) time to compute a k-ball around σ ( (G t )(|V k | + |E k |))|G (d) t-|). Storage-complexity of our estimator: For a single simplex, our estimator requires storing: (i) a pair of integer counts, namely (P t (•, •), P τ t (•, •)), in a datacube, which costs O(1) and (ii) a d + 1dimensional face vector which takes storage O(d + 1). As the total number of simplices is denoted by f d (G t ), we arrive at a total storage cost of O(df d (G t )).

4. THEORETICAL PROPERTY OF THE ESTIMATOR

𝑔 𝑔 " ! 𝑇 → ∞ 𝑔 𝑔 " ! Consistency Asymptotic Normality We show that our estimator has theoretical validity: (i) consistency and (ii) asymptotic normality. (i) The consistency guarantees g T achieves zero error as T increases by converging to g. (ii) The asymptotic normality implies that the error g T -g converges to a normal distribution, which is useful to evaluate the size of the error and can be applied to statistical tests and confidence analysis. Both properties are very important in statistics ( Van der Vaart, 2000) .

4.1. CONSISTENCY

We study the consistency of our estimator. To discuss the property of the estimators with GSCs, it is necessary to organize the Markov property, that the GSC evolution process clearly exhibits. With the event S C , we provide the bias-variance decomposition of g T -g, which is common for theoretical analysis of estimators. The bias represents an error due to the expressive power of the model and the variance represents the over-fitting error due to algorithm uncertainty. By analyzing these terms separately, we can analyze the overall prediction error. We define two functions as h T (F ) = (T -p) -1 T -1 t=p |Gt(d)| j=1 |G t (d)| -1 |P τ t+1 (σ (d) j , F )| and d T (F ) = (T - p) -1 T -1 t=p |Gt(d)| j=1 |G t (d)| -1 |P t+1 (σ (d) j , F )|. For the sake of brevity, we fix F and denote g T (F ) by g T . Similarly, g, h T and d T are used. Also, we define a term B T (F, C) = E[ h T | S C ]/E[ d T | S C ] -g. Then, by Proposition 2 in the supplementary material, we decompose g T -g as g T -g = [ h T -g d T ] -E[ h T -g d T | S C ] d T =:V T (variance) + B T (F, C)E[ d T | S C ] d T =:B T (bias) +o(1). To bound the variance term V T , we make an assumption that our Markov chain X t exhibits a αmixing property which describes a dependent property of the dynamic process. It is one of the most common and well-used assumptions describing time-dependent processes including dynamic graphs (Sarkar et al., 2014b) . Precisely, we present the definition of α-mixing: Definition 5 (α-mixing). A stochastic process X t is α-mixing, if a coefficient α(r), defined as α(r) = sup |t1-t2|≥r {|Pr(A ∩ B) -Pr(A)Pr(B)| : A ∈ Σ(X - t1 ), B ∈ Σ(X + t2 )}, satisfies α(r) → 0 as r → ∞. Here, Σ(X - t1 ) and Σ(X + t2 ) are the sigma-algebras of past and future events of the stochastic process up to and including t 1 . This definition implies that time-dependent processes get close to independent as time passes. That is, events at time t and t + 10000 are close to independent, while events at time t and t + 1 can be correlated. The simplest example is the evolution of stock prices in financial markets: the movement of stock prices today does not correlate with the movement of stock prices 10 years ago. To bound the bias term B T , we impose a smoothness condition on g. A similar assumption is often used in the problem of predicting links (e.g. Assumption 1 in Sarkar et al. (2014a) ). Our assumption is a general and weaker version of the common assumption. Assumption 1 (Smoothness on g). There exists a function κ : R → R in the Schwartz space (i.e. it is infinitely differentiable and converging to zero faster than any polynomial as x → ±∞) with a parameter b > 0 such that |g(F ) -g(F )| = O(κ(-F -F 1 /b)), as b → 0, ∀F, F . We utilize all the assumptions, then prove the consistency of g T . Theorem 1 (Consistency). Suppose that the GSC filtration process is α-mixing, β = o(1), and Assumption 1 holds. Then, for any F and conditional on S C , our estimator g T (F ) is well-defined with probability tending to 1, and | g T (F ) -g(F )| p → 0 holds as T → ∞.

4.2. ASYMPTOTIC NORMALITY

We show the asymptotic normality of the proposed estimator. That is, we prove that the error of the estimator converges weakly to a normal distribution. This property allows for more detailed investigations, such as correcting for errors in estimators or performing statistical tests. Technically speaking, we develop a distribution approximation result with Wasserstein distance (Villani, 2008 ) and Stein's method (Stein et al., 1986) to handle the dependency property of GSCs. We are interested in approximating a random variable Z n by a Gaussian random variable, where Z n is a sum of n mean-centered random variables {A i } n i=0 , where A i corresponds to a random variable which depends on the i-th d-simplex σ i . Here, let d w be the Wasserstein distance between the underlying distributions of the random variables, and N is a standard Gaussian variable. Then, we develop the general results for Z n . We provide the following theoretical result. Its formal statement is Theorem 3, which is deferred to the supplementary material due to its complexity. Proposition 1 (Gaussian approximation for dependent variables; Simple version of Theorem 3). Suppose Z n = n i=0 A i , where {A i } n i=0 is generated by zero-mean random variables X 0 , X 1 , • • • , X n satisfying the α-mixing condition, such as A i = X i /B n with B n = E[ n i=0 X 2 i ]. Also, suppose that Pr{|X i | ≤ L} = 1 holds for i = 1, • • • , n with some constant L > 0. Then, with an existing finite constant C > 0, we have d w (Z n , N ) ≤ C n i=1 E|A i | 3 + T L 3 B 3 n n-1 i=1 nα(n) . This result extends Sunklodas (2007) in the sense of Markov chains on GSCs that satisfy the αmixing condition. This extension makes it possible to study the contributing effect of neighboring d-simplices (represented as a set of weakly dependent r.v.'s) on a central d-simplex. We provide the asymptotic normality of our estimator. It shows that our estimator converges to a normal distribution in terms of the Wasserstein distance, which leads to weak convergence. To achieve the result, we utilize the decomposed terms V T and B T from (7). Then, we regard V T as a sum of dependent random variables and apply the result developed in Proposition 1. Let σ 2 c be a limit of variance of the numerator in T -1/2 V T as T → ∞. We recall that S C denotes the event S t ∈ C, where S t is the state of the Markov chain at time t. Theorem 2 (Asymptotic Normality). Suppose that Assumption 1 holds, the GSC filtration process is α-mixing, and σ c > 0. If β = o(T -1/2 ) and b = o(T -1/2 ), then, for any F and conditioned on S C , the following holds: √ T ( g T (F ) -g(F )) d → N (0, σ 2 c /R(C) 2 ), as T → ∞. By using this property, we can make detailed inferences based on the distribution of the estimation error. For example, it is possible to create confidence intervals for predictions and perform statistical tests to rigorously test hypotheses about simplex arrivals.

5. REAL-WORLD DATA EXPERIMENTS

We empirically evaluate the performance of our proposed estimator on real-world dynamic graphs compared to baselines. The basic premise in our experiments is to capture local and higher-order properties surrounding a d-simplex up to time t to predict the appearance of a new (d + 1)-simplex at time t > t, which contains σ (d) as its face. Note that we compare our method to other closely related methods that were designed to solve different structure prediction tasks. Datasets: We report results on real-world dynamic graph datasets sourced from Benson et al. (2018) . Each dataset contains n nodes, m formed edges, and x timestamped simplices (represented as a set of nodes). There are four datasets named Enron Compared methods: As naive baselines, we averaged the results of single-edge prediction methods, where a new edge would form between each node in the d-simplex and the vertex to be paired with. Specifically, we compare our estimator with: (i) heuristic (Adamic-Adar (AA) (Adamic & Adar, 2001) , Jaccard Coefficient (JC) (Salton & McGill, 1986) , and Preferential attachment (PA) (Mitzenmacher, 2004), (ii) deep-learning based (Node2vec (NV) (Grover & Leskovec, 2016a) , and SEAL (SL) (Zhang & Chen, 2018a )), and (iii) temporal graph network based (TGAT (TT) da Xu et al. (2020) and TGN (TN) Rossi et al. ( 2020)) link prediction methods. We note that (Benson et al., 2018) for predicting a "simplicial closure" has the closest motivation to our method, yet has divergent objectives, therefore we omit comparison to their work. For hyper-edge prediction (HP), we picked the recent most representative work by Yoon et al. ( 2020) to compare against, although this work only works for static non-evolving hypergraphs.

5.1. RESULTS AND DISCUSSION

We averaged the classification accuracy and runtimes of our estimator and the baselines. We performed two sets of experiments on the arrival of a (d + 1)-simplex and summarize it in Table 2 for d = {1, 2}. We also report the bandwidth β for our estimator selected by cross-validation. Predicting 2-simplex (d = 1): We observe that our method is nearly two orders of magnitude faster than the deep learning based methods (NV and SL) and nearly an order of magnitude faster than the hypergraph prediction method (HP). While the single edge heuristic methods are relatively faster, their AUC scores are not comparable to our method's AUC scores. Also, we achieve nearly 30% improvement (in Enron) over the next best performing prediction method. Predicting 3-simplex (d = 2): The gap in AUC scores between our method and the baselines are far more pronounced. Our runtimes also improve due to the far fewer number of simplices with dimensions exceeding 3. As observed in Yoon et al. ( 2020) about slight drops in accuracy for higher-dimensional hyper-edges, we also note that in HP, the AUC score remains the same or drops slightly compared to prediction at d = 1.

Advantage of higher dimensional simplices:

We perform additional experiments by increasing d from 1 to 8 and show that handling high-dimensional simplices exhibits high prediction accuracy. In Figure 3 , the prediction is basically improves as d increases. Empirical summary: Traditional estimators fail to accurately capture the rich latent information present in higher-order structures (and their sub-structures) that evolve over time. Our estimator succinctly captures this information via the f -vector and weighted scoring of (σ (d) , v) pair formation depending on the dimension of the simplex in which the pair co-occur in the past.

6. CONCLUSION

We modeled the higher-order interaction as a simplex and demonstrated a novel kernel estimator to solve the higher-order structure prediction problem. From a theoretical standpoint, we proved the consistency and asymptotic normality of our estimator. We empirically argue that our estimator outperforms hypergraph based and higher-order link prediction baselines from both heuristic and deep-learning based pairwise link prediction methods.

Supplementary Materials of Higher-order Link Prediction in Dynamic Graph Simplicial Complexes A EXAMPLE OF SIMPLEX AND RELATED NOTIONS

Example 1. We begin by computing the k-balls centered at 1-simplex [9, 10] in G t and G t-1 , respectively. The k-ball at time t for k = 1 (i.e., 1-hop vertices only) centered at [9, 10] is: B t,1 ([9, 10]) = B t,1 ([9]) ∪ B t,1 ([10] ). This is the union of k-balls at underlying vertices 9 and 10 according to Definition 3. B t,1 ([9, 10]) = B t,1 ([9]) ∪ B t,1 ([10]) = {9 , 13, 8, 5, 6, 10} ∪ {10, 14, 13, 9, 6, 7, 11, 15} = {9, 13, 8, 5, 6, 10, 14, 7, 11, 15} Similarly, The k-ball at previous time step t -1 for k = 1 (i.e., 1-hop vertices only) centered at [9, 10] is:  B t-1,1 ([9, 10]) = B t-1,1 ([9]) ∪ B t-1,1 ([10]) = {9,

B FURTHER DETAILS OF THE EXPERIMENT B.1 DATASETS

We report results on real-world dynamic graph datasets sourced from Benson et. al. Benson et al. (2018) . Each dataset is a set of timestamped simplices (represented as a set of nodes). In each dataset, let n, m, and x denote the number of nodes, edges formed, and timestamped simplices, respectively. Enron (n = 143, m = 1.8K, x = 5K) and EU (n = 998, m = 29.3K, x = 8K) model email networks where nodes are email addresses and all recipients of an email form a simplex in the network. Contact (n = 327, m = 5.8K, x = 10K) is a proximity graph where nodes represent persons and a simplex is a set of persons in close proximity to each other. NDC (n = 1.1K, m = 6.2K, x = 12K) is a drug network from the National Drug Code directory, where nodes are class labels and a simplex is formed when a set of class labels appear together on a single drug. -Adar Adamic & Adar (2001) and the Jaccard Coefficient Salton & McGill (1986) measure link probability between two nodes based on the closeness of their respective feature vectors. Preferential attachment Mitzenmacher (2004) has received considerable attention as a model of growth of networks as they model future link probability as the product of the current number of neighbors of the two nodes. Motivated by resource allocation in transportation networks (much alike the Optimal Transport (OT) problem), Resource allocation index Zhou et al. (2009b) proposes a node x tries to transmit a unit resource to node y via common neighbors that play the role of transmitters and similarity is measured by the amount of the resource y received from x. Node2vec Grover & Leskovec (2016a) and SEAL Zhang & Chen (2018a) are deep-learning based graph embedding methods that are used in link prediction.

Adamic

Remark 1 (Difference between our setting and simplical closure). The closest work Benson et al. (2018) proposed predicting a "simplicial closure" event where at time t there exists a set of nodes which are pairwise edge connected and the task is to predict whether at time t + 1 there will arrive a simplex which covers all these nodes. This phenomenon was termed as simplicial closure. For example, authors A, B and C have all co-authored in pairs (i.e., {A, B}, {A, C} and {B, C}) and a simplicial closure event would take place at t + 1, if a simplex {A, B, C} arrives, implying that all three authors co-author on a single paper. Our prediction task significantly diverges and aims to solve a different problem. Considering our previous example, we are given a single co-authorship relationship between say A and B at time t, we predict whether authors A and B will co-author with a third author C (ternary co-authorship relationship) on a single paper at time t + 1 in the future.

C PROOF FOR CONSISTENCY

For preparation, we rewrite the estimator g T . Plugging in the definition of our kernel (Equation 5) into the equation of our estimator (Equation 6) along with the definitions of P t (•, •) and P τ t (•, •) to replace the indicator variables with actual counts, we obtain the following simplification of Equation 6. Then, it is reformulated as g T (F ) = T t =T -p σ (d) j ∈G t (d) |P τ t +1 (σ (d) j , F )| + β s∈Γ(F,δ) |P τ t +1 (σ (d) j , s)| T t =T -p σ (d) j ∈G t (d) |P t +1 (σ (d) j , F )| + β s∈Γ(F,δ) |P t +1 (σ (d) j , s)| . ( ) When we set β = 0, we look for other pairs whose feature corresponds to F and we calculate a fraction of how many such close by pairs actually form a (d + 1)-simplex at time t + 1. This fraction is summed across various d-simplices σ by varying j and also across various discrete time steps by varying t . Setting β > 0 allows our estimator to smooth over close by features. It turns out simpler to study a proxy estimator g T , which omits the smoothing of the original one. We will show that it is asymptotically equivalent to g. Let |G t (d)| denote the total number of d-simplices in GSC G t at time t. Then, for a feature F , we define the proxy estimator as g T (F ) := h T (F ) d T (F ) , ( ) where the terms are defined as h T (F ) = 1 T -p T -1 t=p |Gt(d)| j=1 |P τ t+1 (σ (d) j , F )| |G t (d)| , and d T (F ) = 1 T -p T -1 t=p |Gt(d)| j=1 |P t+1 (σ (d) j , F )| |G t (d)| . Observe that (d) i , v m ] ((d + 1)-simplex) that have the same feature vector F . Lemma 1 in the supplementary material proves | g T (F ) -g T (F )| → 0 as β → 0. First of all, we show the validity of the proxy estimator g T . Lemma 1 (Approximation by proxy). We obtain | g T (F ) -g T (F )| = O(β), ∀F. Proof of Lemma 1. Recall that Γ(F, δ) denotes the set of features at L 1 -distance at most δ from F . We denote by |Γ(F, δ)| the cardinality of this set. g T (F ) = h T (F ) + C 1 d T (F ) + C 2 where C 1 = β j,t s∈Γ(F,δ) |P τ t+1 (σ  | g T (F ) -g T (F )| = h T (F ) + C 1 d T (F ) + C 2 - h T (F ) d T (F ) = O(β) In the last step, the second fraction is a positive constant and can thus be ignored from our asymptotic analysis because both h T and d T are bounded. Next, we prove the convergence of the proxy estimator g T (F ). As the first step, we describe the detail of its decomposition in (7). To simplify the notation, we will drop F from all estimator notations. Proposition 2. As written in Equation 7, we obtain g T -g = V T + B T . Furthermore, with the event S C , there exist a stochastic terms q t for t such as V T = (T -p) -1 T -1 t=p q t d T . Proof of Proposition 2. With the definition of B T (F, C), we have g T -g = h T d T -g (11) = h T -g d T d T = [ h T -g d T ] -E[ h T -g d T | S C ] + E[ h T -g d T | S C ] d T = [ h T -g d T ] -E[ h T -g d T | S C ] d T + E[ h T | S C ] -gE[ d T | S C ] d T = [ h T -g d T ] -E[ h T -g d T | S C ] d T + B T (F, C)E[ d T | S C ] d T = V T + B T . ( ) We are interested in the asymptotic behavior of the Markov chain at time T → ∞. Let F denote F T (σ (d) i , v m ). Recall that the terms |P τ T +1 (•, •)| and |P T +1 (•, •)| count all actual and possible forma- tion of [σ (d) i , v m ] ((d+1)-simplex) that result in the same feature vector F at time T +1. Our Markov chain has a finite state space and hence belongs to a closed communication class with probability approaching to 1. We provide a statistical consistency conditional on S C for any communication class C. For a given time step t, we define h T (t) := 1 |G t (d)| |Gt(d)| j=1 |P τ t+1 (σ (d) j , F )| (13) d T (t) := 1 |G t (d)| |Gt(d)| j=1 |P t+1 (σ (d) j , F )| Note that h T = 1 T -p T -1 t=p h T (t) and d T = 1 T -p T -1 t=p d T (t). Let us set q t := [ h T (t) -g d T (t)] -E[ h T (t) -g d T (t) | S C ] Note that q t is the numerator of the stochastic term in Equation 7and a bounded deterministic function of S C at a given time step t. For the stochastic term d T which appears in the denominator, we show its convergence. The following two lemmas provide the result. Lemma 2. If the GSC process is α-mixing, then, as T → ∞, we obtain Var( h T (F )|S C ) → 0, and Var( d T (F )|S C ) → 0, for any F . Proof of Lemma 2. We show that variance divided by T converges to a non-negative constant. Let U T := t q t / √ T , where q t (as shown in Equation 14) is a bounded deterministic function of the state of X t at time t. As demonstrated in Sarkar et. al. Sarkar et al. (2014a) , we too break our weighted sum U T across three time intervals: (i) [1, T C -1], (ii) [T C , T C + M -1], and (iii) [T C + M, T ], where M is a constant. Now, from Sarkar et al. (2014a) , we simply apply Lemma 5.7 to get that E[Var(U T | E(T C ), S C ) | S C ] → σ c , for some σ c ≥ 0 and from Lemma 5.8, we have that Var(E[U T | E(T C ), S C ] | S C ) = o(1).

Now, since the law of total variance provides

Var(U T | S C ) = E[Var(U T | E(T C ), S C ) | S C ] + Var(E[U T | E(T C ), S C ] | S C ) we use the previous results from Lemmas 5.7 and 5.8 in Sarkar et al. (2014a) to get Var(U T | S C ) → σ c as T → 0, for some constant σ c ≥ 0 Plugging in the definition of q t into U T and calculating Var(U T | S C ), it follows trivially that Var( h T | S C ) → 0 and Var( d T | S C ) → 0 as T → ∞. We refer readers to Remark 5.10 in Sarkar et al. (2014a) to see how these results also hold in the case when C is aperiodic. Lemma 3. If the GSC process is α-mixing, then there exist a function R(C) with a deterministic function of class C denote, such as lim T →∞ E[ d T (F ) | E(T C ), S C ] = R(C), and lim T →∞ E[ d T (F ) | S C ] = R(C). Proof of Lemma 3. We know by definition that E[ d T (F ) | E(T C ), S C ] = 1 T -p T -1 t=p |Gt(d)| j=1 E |P t+1 (σ (d) j , F )| |G t (d)| E(T C ), S C This is an average of terms E |Pt+1(σ (d) j ,F )| |Gt(d)| E(T C ), S C spanning across d-simplices with indices j ∈ {1, • • • , |G t (d)|} and discrete time steps t ∈ {p, • • • , T -1}. For ease of notation, let X j := |P t+1 (σ (d) j , F )| |G t (d)| X j denotes the total number of possible (d + 1)-simplices with a d-face as σ (d) j divided by the total number of d-simplices in G t . In the R.H.S. of Equation 15, the term inside the summation is simplified as E[X j | E(T C ), S C ] = x xPr[X j = x | E(T C ), S C ] We know that both P t+1 (σ (d) j , F ) and G t (d) are fully determined given the current state S t of the Markov chain. Let I S (Y ) denote an indicator variable of whether "Y is in state S" or not. We have, Pr[ X j = x | E(T C ), S C ] = S I S (X j = x)Pr[S t = S | E(T C ), S C ] As a result, the R.H.S. of Equation 15becomes 1 T t S ( j,x xI S (X j = x))Pr[S t = S | E(T C ), S C ] Let λ(S) = j,x xI S (X j = x) as this term is fully determined by state S. Then, Equation 16 can be rewritten as S λ(S) t Pr[S t = S | E(T C ), S C ] T Due to stationarity, the average t Pr[S t = S | E(T C ), S C ]/T will converge to a constant function of state S, denoted by R(S). Given that λ(S) is bounded and the average term converges to a constant R(S), we say that Equation 17 converges to some constant R(C) > 0, where R(C) is a deterministic function of communication class C. This proves the first part. A simple application of the tower property of expectation followed by the dominated convergence theorem shows that lim T →∞ E[ d(F ) | S C ] = R C . This completes the proof for the result. Then, we are ready to prove the convergence of the variance term. Proposition 3 (Variance). If the GSC filtration process is α-mixing, then, conditional on S C , we obtain V T p → 0 as T → ∞. Proof of Proposition 3. By Proposition 2, the term V T is written as (T -p) -1 t qt d T . For the denominator, Lemma 3 shows that E[ d T | S C ] → R(C), where R(C) is a positive deter- ministic function of class C. Also, Lemma 2 states that V ar( d T | S C ) → 0 holds as T → ∞. Thus, d T p → R(C) > 0 holds. Also, V T is asymptotically well defined for class C. For the nominator, Lemma 2 also shows that lim T →∞ V ar( t q t /T | S C ) = 0 as T → ∞ and E[q t | S C ] = 0, therefore we have lim T →∞ 1/T t q t qm → 0 conditioned on S C . By the continuous mapping theorem, we obtain the statement. Next, we discuss the bias term B T . To this aim, we rewrite the term B T (F, C) as follows: B T (F, C) = (E[ h T (F ) | S C ] -g(F )E[ d T (F ) | S C ]) E[ d T (F ) | S C ] . Lemma 4. Given that Assumption 1 holds and that when T → ∞, the bandwidth parameter b → 0. Then, we have that B T (F, C) = O(b) = o(1) as T → ∞. Proof of Lemma 4. For t ∈ [p, T -1], j ∈ [1, |G t (d)| ] and a fixed feature vector F , the numerator of B T (F, C) can be expressed as an average of terms of the form A t := E |P τ t+1 (σ (d) j , F )| |G t (d)| S C -E |P t+1 (σ (d) j , F )| |G t (d)| S C g(F ) The first term in Equation 18 can be rewritten using the tower property as E E |P τ t+1 (σ (d) j , F )| |G t (d)| E(T C ), S C S C When we condition on E(T C ), it makes |P τ t+1 (σ (d) j ,F )| |Gt(d)| conditionally independent of S C , if t > T C . Also, for t ≥ T C , we have E |P τ t+1 (σ (d) j , F )| |G t (d)| E(T C ), S C = |P t+1 (σ (d) j , F )| |G t (d)| g(F t (σ (d) j , v n )) where v n ∈ B t-1,k (σ (d) j ). Given the result in Equation 19and the fact that the term |P τ t+1 (σ (d) j ,F )| |Gt(d)| is bounded results in E |P τ t+1 (σ (d) j , F )| |G t (d)| E(T C ), S C ≤ |P t+1 (σ (d) j , F )| |G t (d)| g(F t (σ (d) j , v n ))I[T C ≤ t] + cI[T C > t] ≤ |P t+1 (σ (d) j , F )| |G t (d)| g(F t (σ (d) j , v n )) + cI[T C > t], where > 0 an existing constant. Now, the numerator of B T (F, C) can be upper bounded as t A t /T ≤ t 1 T E |P t+1 (σ (d) j , F )| |G t (d)| (g(F t (σ (d) j , v n )) -g(F )) S C + c t Pr[T C > t]/T. The second term in Equation 21 vanishes as T → ∞ because it is of order O(E[T C ]/T ). Thus, the numerator of B T (F, C) is an average of terms of the form E |P t+1 (σ (d) j , F )| |G t (d)| (g(F t (σ (d) j , v n )) -g(F )) S C . Our feature vector counts and simplex neighborhoods are finite because |G t (d)| is bounded. The expectation in Equation 22 is just a summation of finite terms. We set F = F t (σ (d) j , v n ), and make use of our smoothness assumption 1, so that |g(F ) -g(F )| = O(κ(-F -F 1 /b)). We also use Lemma 3 to say that the denominator of our bias term converges to a constant R(C). So, Proof of Theorem 1. For the result of g T , we apply the results of Proposition 3 and 4 to the decomposition in the equation 7, then obtain the statement. B T (F, C) = O(κ(-F -F 1 /b)) = O(b). For the result of g T , we additionally combine the result of Lemma 1, then obtain the statement. We now make use of the Wasserstein metric to measure the distance between distributions. Therefore, our estimator represented as W can be shown to converge to Z, when the Wasserstein distance between W and Z's underlying distributions converges to zero. We have that d w (W, Z) = sup h∈BL(R) |Eh(W ) -Eh(Z)| D.1.1 INTRODUCTION TO STEIN'S METHOD FOR NORMAL APPROXIMATION Stein Stein et al. (1986) introduced a powerful technique to estimate the rate of convergence of sums of weakly dependent r.v.s to the standard normal distribution. A remarkable feature of Stein's method is that it can be applied in many circumstances where dependence plays a role, therefore we propose an adaptation of Stein's method to our setting of dynamic GSCs. Given a standard normal r.v. Z, Stein's lemma (stated below) provides a characterization of Z's distribution. Lemma 5 (Stein's Lemma Chen et al. (2010) ). If W has a standard normal distribution, then Ef (W ) = E[W f (W )], for all absolutely continuous functions f : R → R with E|f (Z)| < ∞. Conversely, if Equation 24 holds for all bounded, continuous, and piecewise continuously differentiable functions f with E|f (Z)| < ∞, then W has a standard normal distribution. In order to show that a r.v. W has a distribution close to that of a target distribution of Z, one must compare the values of expectations of the two distributions on some collection of bounded functions h : R → R. Here, Stein's lemma (Lemma 5) shows that W d = Z, if Ef (W ) -E[W f (W )] = 0 holds. Observe that if the distribution of W is close to that of Z's distribution, then evaluating the L.H.S of Equation 25 when W is replaced by Z would result in a small value. Putting these difference equations together, the following linear differential equation known as Stein's equation is arrived at f (W ) -W f (W ) = h(W ) -Eh(Z) The f that satisfy Equation 26 with h ∈ BL(R) must satisfy the following conditions for all y, z ∈ R f ≤ 2 , f ≤ 2 , f ≤ 2/π (27) |f (y + z) -f (y)| ≤ D|z| where h 0 (y) = h(y)-Eh(Z), c 1 = sup x≥0 ξ(x), c 2 = sup x≥0 x(1-xξ(x)), and ξ(x) = (1-Φ)/φ (where Φ(x) is the distribution function and φ(x) = Φ (x)). Then, D = (c 1 + c 2 ) h 0 ∞ + 2 is a constant. Additionally, we have a bound on the covariance, given the dependent r.v.'s are also bounded. If Pr{|X| ≤ C 1 } = Pr{|Y | ≤ C 2 } = 1, then (28) | Cov(X, Y )| ≤ 4C 1 C 2 α(r) We take a similar approach to Sarkar et. al. Sarkar et al. (2014a) in terms of using the Wasserstein distance to bound the normal approximation. We first define the dependency in our GSCs and then propose a notion of α mixing in our context of GSCs. We obtain a tighter bound than the bound proposed in Sarkar et al. (2014a) , by instead following an approach proposed by Sunklodas Sunklodas (2007) .

D.2 GAUSSIAN APPROXIMATION FOR DEPENDENT VARIABLES WITH GSC

In our model, we assume the r.v. A i to represent a d-simplex σ (d) i in a GSC G. In order to have a notion of α-mixing in our setting, we must first define a distance between two d-simplices σ i and σ j . We drop the (d) superscript for brevity and ease of notation. We define this distance as the Hausdorff distance between simplices as d H (σ i , σ j ) = max sup v∈σi inf v ∈σj d g (v, v ), sup v ∈σj inf v∈σi d g (v, v ) where d g (v, v ) counts the number of edges in the geodesic connecting vertices v and v in G (1) -. In Stein's method, the sum of dependent r.v.'s is studied by breaking the sum Z n into two sets based on the r in mixing coefficient α(r). In our setting, given a fixed d-simplex σ i , we study two partial sums pertaining to: 1) all d-simplices that are at most r-apart from A i and 2) the remaining partial sum after removing the variables pertaining to 1) from Z n . With this notion of distance between sets of r.v.'s, we modify with slight deviations from proposition 4 in Sunklodas et. al. Sunklodas (2007) to accommodate our α-mixing in Markov chains based on GSCs. For a sequence of r.v.'s X 1 , X 2 , • • • satisfying the α-mixing condition, we write Z n = n i=1 A i , A i = X i B n , B 2 n = E( n i=0 X i ) 2 We assume B n > 0.

T (m) i

denotes the contribution of d-simplices that are further than m away from σ i and x(σ i , r) denotes the partial sum of r.v.'s representing simplices that are exactly r away from A i . Therefore, m r=0 x(σ i , r) gets us all those d-simplices that are greater than or equal to m away from σ i . We are interested in the contribution of simplices r away from σ i as we vary r from 0 to m. With Proposition 5, we proceed to derive an upper bound on d w (Z n , N ) (i.e., the Wasserstein distance between Z n and N ). Proposition 5. Let S(σ i , r) denote the set of d-simplices whose Hausdorff distance equals r. More formally, S(σ i , r) = {σ j : d H (σ i , σ j ) = r} Additionally, let X denote a mean-centered version of r.v. X. Then, x(σ i , r) = p∈S(σi,r) A p (x(σ i , 0) = A i ) T (m) i = Z n - m r=0 x(σ i , r), m = 0, 1, • • • , (T (-1) i = Z n ) Suppose that EZ n = 0, EZ 2 n = 1, and EA 2 i < ∞ for all i = 1, • • • , n. Let be a r.v. uniformly distributed in [0, 1] and independent of other r.v.'s. Let f : R → R be a differentiable function such that sup x∈R |f (x)| < ∞. Then we have Ef (Z n ) -EZ n f (Z n ) = E 1 + • • • + E 7 where E 1 = - n i=1 r≥1 EA i x(σ i , r) f (T (r) i + x(σ i , r)) -f (T (r) i ) E 2 = - n i=1 EA 2 i f (T (0) i + A i ) -f (T (0) i ) E 3 = - n i=1 r≥1 2r q=r+1 E A i x(σ i , r)δ (q) i , E 4 = - n i=1 r≥1 q≥2r+1 E A i x(σ i , r)δ (q) i E 5 = n i=1 r≥1 EA i x(σ i , r) r q=0 Eδ (q) i , E 6 = - n i=1 q≥1 E A 2 i δ (q) i , E 7 = n i=0 EA 2 i Eδ (q) i , δ (q) i = f (T (q-1) i ) -f (T (q) i ). Theorem 3. Consider a sequence of r.v.'s X 1 , X 2 , • • • that satisfy α-mixing condition (Definition 5). Let EX i = 0, Pr{|X i | ≤ L} = 1, for i = 1, • • • , n, for some constant L > 0. Then, for every h ∈ BL(R), d w (Z n , N ) ≤ C(S, D) n i=0 E|A i | 3 + nL 3 B 3 n n-1 r=1 rα(r) where C(S, D) is a finite constant which depends on D and max i |S(σ i , r)|. Proof of Theorem 3. Given a d-simplex σ i and its corresponding r.v. A i , recall that S(σ i , r) denotes the set of d-simplices that are at Hausdorff distance r away from σ i . We additionally define S m to denote the maximum cardinality of S(σ i , r) for all i and a fixed r, i.e., S m = max i |S(σ i , r)| In order to upper bound the Wasserstein distance between Z n and N , we estimate the difference Eh(Z n ) -Eh(N ) using Proposition 5. It was shown in Proposition 5 that this difference is a sum of terms E 1 , • • • , E 7 . We will proceed by individually bounding each term. Bounding E 1 : |E 1 | = n i=1 r≥1 EA i x(σ i , r) f (T (r) i + x(σ i , r)) -f (T (r) i ) (i) We can upper bound term (i) in Equation 30 using Equation 27 by D| ||x(σ i , r)| ≤ D|x(σ i , r)| (since | | ≤ 1) Then, x(σ i , r) f (T (r) i + x(σ i , r)) -f (T (r) i ) ≤ D(x(σ i , r)) 2 Now, we upper bound x(σ i , r) as  x(σ i , r) ≤ S m L B n

We know that

Cov A i , x(σ i , r) f (T (r) i + x(σ i , r)) -f (T (r) i ) (33) = E A i x(σ i , r) f (T (r) i + x(σ i , r)) -f (T (r) i ) (ii) + EA i =0 Ex(σ i , r) f (T (r) i + x(σ i , r)) -f (T (r) i ) Notice that term (ii) is nothing but the summand in Equation 30. We apply the covariance bounds (Equation 28), to obtain (ii) ≤ 4 L B n DS 2 m L 2 B 2 n α(r) ≤ 4DS 2 m L 3 B 3 n α(r) Then, we have that |E 1 | ≤ n i=1 r≥1 4DS 2 m L 3 B 3 n α(r) ≤ 4DS 2 m nL 3 B 3 n n-1 r=1 α(r). Here, the summation n-1 r=1 appears, because there are only n variables that should measure the dependence between each other. Bounding E 2 : |E 2 | = n i=1 EA 2 i f (T (0) i + A i ) -f (T (0) i ) (iii) Using Equation 27, we have that term (iii) ≤ D|A i |. Then, |E 2 | can simply be bounded as |E 2 | ≤ D n i=1 E|A i | 3 Bounding E 3 : |E 3 | = n i=1 r≥1 2r q=r+1 E A i x(σ i , r)δ (q) i (38) ≤ n i=1 r≥1 2r q=r+1 EA i x(σ i , r)δ (q) i (iv) + n i=1 r≥1 2r q=r+1 |EA i x(σ i , r)| E δ (q) i (v) Let us focus on bounding terms (iv) and (v), separately. For term (iv): We can further split it as E A i a x(σ i , r)δ (q) i b We have previously worked out the bounds for terms A i and x(σ i , r). Therefore, we now focus our attention on bounding δ (q) i . δ (q) i = f (T (q-1) i ) -f (T (q) i ) = f (T (q) i + x(σ i , r)) -f (T (q) i ) ≤ D|x(σ i , r)| (using Equation 27) Applying the covariance bound (Equation 28), we have that ≤ D S m L B n (iv) ≤ 4 L B n DS 2 m L 2 B 2 n α(r) ≤ 4DS 2 m L 3 B 3 n α(r) For term (v): We have calculated some bounds previously, so we can again split (v) as Finally, we combine these bounds to arrive at our final upper bound as E A i x(σ i , r) c E δ (q) i d ≤ 4 L B n S m L B n α( |Ef (Z n ) -EZ n f (Z n )| ≤ C(S m , D) n i=0 E|A i | 3 + nL 3 B 3 n n-1 r=1 rα(r) where C(S m , D) is a constant term depending on constants K m and D. This completes our proof.

D.3 DEFERRED PROOF

In this section, we establish our estimator's result on the asymptotic normality. Also, from Equation 14, we have q t := [ h T (t) -g d T (t)] -E[ h T (t) -g d T (t) | S C ] Additionally, let us define a variable p t for convenience as follows p t := [ h T (t) -g d T (t)] -E[ h T (t) -g d T (t) | E(T C ), S C ] Note that p t is conditioned on both E(T C ) and S C , while q t is just conditioned on S C . Keeping these expressions in mind, we begin by showing the weak convergence with p t . Lemma 6. Under Assumption 1, given p t for any finite T C and σ c > 0 t≥T C +M p t / √ T d → N (0, σ 2 c ) conditioned on E(T C ) ∩ S C as T → ∞ Proof of Lemma 6. Given a normalized r.v. W T which is a sum of weakly dependent r.v.'s as W T := t≥T C +M p t Var t≥T C +M p t | E(T C ), S C where p t is already bounded and mean-centered. We have to show that our upper bound in Theorem 3 converges to zero, so that according to Stein's lemma 5, we have Note that our p t corresponds to A i and T to n in Theorem 3. As p t is a function of S t , it involves p+1 GSCs (G t-p+1 , • • • , G t+1 ) and the distance between the i and j-th GSC is defined as dist(i, j) = max(|i -j| -(p + 1), 0). Therefore, you will observe that we have for x(σ i , r) only 2 states that are at distance r apart from σ i , i.e., A i -r and A i + r. Therefore, S m = O(1). Given that Pr{|p t | ≤ L} = 1, for t = 1, • • • , T and for some constant bound L > 0, we have where K(L, C 1 , c 0 ) is a positive and finite constant only depending on the quantities within the parenthesis. The inequality follows the fact B 2 T ≥ c 0 T . Thus, we achieve an asymptotic bound of O(T -1/2 ). This completes our proof. The weak convergence with p t implies the weak convergence with q t , by a simple application of Lemma 7.2 in Sarkar et al. (2014a) . Lemma 7 (Lemma 7.2 in Sarkar et al. (2014a) ). Suppose Lemma 6 holds. Then, under Assumption 1 and assuming σ c > 0, Conditioned on S C , t q t / √ T d → N (0, σ 2 c ) as T → ∞ We now prove the weak convergence of our estimator. Proof of Theorem 2. By the definition of g T -g in Theorem 1, we achieve √ T ( g T -g) = √ T ( g T -g T ) + √ T ( g T -g) = O( √ T β) + T -1/2 t q t d T + T 1/2 B T E[ d T |S C ] d T . We are able to assess the convergence of each item. where W T is a random variable such as W T d → N (0, σ 2 c /R(C) 2 ). With the settings β = o(T -1/2 ) and b = o(T -1/2 ), we obtain the statement.



We handle the incremental model (edge insertions only) as opposed to the harder fully dynamic model (edge insertions and deletions allowed) for which most previous methods too cannot provide theoretical guarantees. A topological invariant is a property that is preserved by homeomorphisms. all simplices are placed on a line each in increasing order of their dimension In practice, the total number of d-simplices is much less than the maximum possible cliques with n vertices.



Figure 1: [Left] Given a 4-node graph, at time t, the 2-simplex [A, B, C] also contains 1-simplices [A, B], [B, C] and [A, C]. At time t > t, the 2-simplex evolves (by connecting with D) to a 3simplex [A, B, C, D] which additionally contains 1-simplices [A, D], [B, D] and [C, D], along with 2-simplices [A, B, D], [A, C, D] and [B, C, D]. [Middle] Simplex setting with Benson et al. (2018). The method predicts [A, B, C] (closed triangle) at time t > t from an open triangle with links/edges [A, B], [B, C] and [A, C] at time t. [Right] Hypergraph represents a triple [A, B, C] as a hyperedge, without any of its subsets. It cannot distinguish between [[A, B], [B, C], [A, C]] and [A, B, C].

(Middle)) all faces [A, B], [B, C] and [A, C] (open triangle) need to be present in order to predict a closed triangle [A, B, C]. Contrastingly, in our approach, just a single face/edge like [A, B] or [B, C] or [A, C], suffices to predict its evolution to [A, B, C]. Figure 1 (Left) illustrates an additional example of our proposal to predict a 3-simplex [A, B, C, D] given only one of its faces [A, B, C].

Figure 2: Example of evolution of GSC G. The yellow triangles are 2-simplices. In the k-ball around 1-simplex [9, 10] (in red), a 1-simplex [10,7] is added at time t.

number of simplices (with dimension at most d) from the previous p time steps, by intersecting against them to get counts for the face vector and data cube. Each intersection test takes O(|V k |) time. Recall that f d (G t ) denoted the total number of d-simplices in G t . V k and E k are the set of vertices and edges in the k-hop subgraph of a vertex. We must check this k-ball of σ (d) against p|G (d) t-| number of simplices (with dimension at most d) from the previous p time steps. So, the entire computation across T time chunks has a time complexity of O(T pf d

(d) i in our GSC, which is dependent on other d-simplicesUnder review as a conference paper at ICLR 2021 whose neighborhoods largely overlap with that of σ (d)

(n = 143, m = 1.8K, x = 5K), EU (n = 998, m = 29.3K, x = 8K), Contact (n = 327, m = 5.8K, x = 10K), and NDC (National Drug Code) (n = 1.1K, m = 6.2K, x = 12K).Experimental setup: We first ordered by arrival times and grouped the timestamped simplices into T time slices. For most of our experiments, T was set to 20, except for d = 2, where T was set to 6 and 12 for EU and NDC, respectively. Then, we randomly sampled a set of d-simplices from the time slices in the range [1, T -1]. Those d-simplices paired with a vertex that successfully formed a face in a (d + 1)-simplex in the T -th time slice were classified as positive samples, while the rest were deemed as negative samples. We picked an equal number of positive and negative samples for evaluation. For K-fold cross-validation for β, we swapped the T -th time slice with one of the K slices preceding the T -th time slice for each fold. K was set to 3. All experiments where repeated 10 times and average AUC scores and runtimes are reported.

Figure 3: AUC score predicted by our estimator for future formation of a dsimplex, given a d -1-simplex.

[9, 10],[9, 6],[9, 5],[9, 8],[9, 13],[10, 6],[10, 7],[10, 11],[10, 15],[10, 14],[10, 13],[9, 6, 10],[10, 14, 15]} Now, we compute the compressed f -vector notation of G t([9, 10]) to get f (G t([9, 10])) =(1, 10, 11, 2)    Finally, the neighborhood N t ([9, 10]) = (1, 10, 11, 2).

)| and C 2 = β j,t s∈Γ(F,δ) |P t+1 (σ (d) j , s)|. Due to the finiteness of features in P τ t+1 (•, •) and P t+1 (•, •), we have that C 1 = C 2 = O(β). Both C 1 and C 2 are non-negative integers. So,

The last equality follows the property of κ in the Schwartz space. Now, since B T (F, C) = O(b) and b → 0 as T → ∞, then we have that B T (F, C) = o(1). This completes the proof. Then, we prove the convergence of the bias. Then, we have the result: Proposition 4 (Bias). If Assumption 1 holds, then, conditional on S C , B T p → 0 holds as T → ∞. Proof of Proposition 4. Proposition 2 shows that B T (F ) = B T (F, C)/ d T (F ). By Lemma 2 and 3, we obtain d T → R(C) > 0 as T → ∞, as similarly shown in the proof of Proposition 3. Further, Lemma 4 states that B T (F, C) = o(b). Combining the results, we obtain the statement. Now, we can prove the consistency (Theorem 1).

PROOF FOR ASYMPTOTIC NORMALITY D.1 INTRODUCTION TO WASSERSTEIN DISTANCE AND APPROXIMATION TECHNIQUE We denote by BL(R) the space of such bounded functions h that are 1-Lipschitz. More formally, h ∞ = sup x∈R |h(x)| < ∞ and Lip(h) = 1, where Lip(h) = sup x =y |h(x) -h(y)| |x -y| So, h ∈ BL(R).

Therefore, term b (in term (iv)) is upper bounded by DS 2

the inequalities for terms (iv) and (v), we have that |E 3 our previous bounds, the remaining terms, i.e., |E 5 |, |E 6 |, and |E 7 | are bounded as follows:

Let n = |G t (d)| denote the total number of d-simplices in GSC G t at time t. Recall from Equation

conditioned on E(T C ) ∩ S C Recall that by Lemma 5.7 in Sarkar et al. (2014a), we have Var   t≥T C +M p t | E(T C ), S C   /T → σ 2 c Now, we show that conditioned on event E(T C ) ∩ S C , our bound in Theorem 3 has a convergence rate of O(T -1/2 ).

We additionally impose that ∞ r=1 rα(r) < ∞ because we use a decaying function for α(•) and B 2T ≥ c 0 T with a positive constant c 0 . We chooseα(r) ≤ C 1 1 r 2 (log r) p ,for r ≥ 2 and fixed p > 1, where 0 < C 1 < ∞ is a constant. Then,

Since Lemma 3 shows E[ d T |S C ] → R(C) and Lemma 2 shows Var( d T |S C ) → 0, we obtain d T p → R(C). By Lemma 7, T -1/2 t q t converges to N (0, σ 2 c ). Further, Lemma 4 proves B T = O(b). Combining these results with the Slutsky's lemma, we get the following results: √ T ( g T -g) = O( √ T β) + W T + O P ( √ T b),

Notation table

It is well-known that there exists a set of irreducible closed communication classes C in the state space S. We denote the time of entering class C by T C and the event as E(T C ). Let S C denote the event S t ∈ C, where S t is the state of the Markov chain at time t. Then, E(T C ) ∩ S C is the event that the chain enters class C at time T C and remains in that communication class indefinitely.

AUC scores and runtimes for baselines versus our method's estimator for d = 1 and d = 2.

