EXACT REPRESENTATION OF SPARSE NETWORKS WITH SYMMETRIC NONNEGATIVE EMBEDDINGS Anonymous

Abstract

Many models for undirected graphs are based on factorizing the graph's adjacency matrix; these models find a vector representation of each node such that the predicted probability of a link between two nodes increases with the similarity (dot product) of their associated vectors. Recent work has shown that these models are unable to capture key structures in real-world graphs, particularly heterophilous structures, wherein links occur between dissimilar nodes. In contrast, a factorization with two vectors per node, based on logistic principal components analysis (LPCA), has been proven not only to represent such structures, but also to provide exact lowrank factorization of any graph with bounded max degree. However, this bound has limited applicability to real-world networks, which often have power law degree distributions with high max degree. Further, the LPCA model lacks interpretability since its asymmetric factorization does not reflect the undirectedness of the graph. We address the above issues in two ways. First, we prove a new bound for the LPCA model in terms of arboricity rather than max degree; this greatly increases the bound's applicability to many sparse real-world networks. Second, we propose an alternative graph model whose factorization is symmetric and nonnegative, which allows for link predictions to be interpreted in terms of node clusters. We show that the bounds for exact representation in the LPCA model extend to our new model. On the empirical side, our model is optimized effectively on real-world graphs with gradient descent on a cross-entropy loss. We demonstrate its effectiveness on a variety of foundational tasks, such as community detection and link prediction.

1. INTRODUCTION

Graphs naturally arise in data from a variety of fields including sociology (Mason & Verwoerd, 2007) , biology (Scott, 1988) , and computer networking (Bonato, 2004) . A key underlying task in machine learning for graph data is forming models of graphs which can predict edges between nodes, form useful representations of nodes, and reveal interpretable structure in the graph, such as detecting clusters of nodes. Many graph models fall under the framework of edge-independent graph generative models, which can output the probabilities of edges existing between any pair of nodes. The parameters of such models can be trained iteratively on the network, or some fraction of the network which is known, in the link prediction task, i.e., by minimizing a predictive loss. To choose among these models, one must consider two criteria: 1) whether the model can express structures of interest in the graph, 2) whether the model expresses these structure in an interpretable way. Expressiveness of low-dimensional embeddings As real-world graphs are high-dimensional objects, graph models generally compress information about the graph. Such models are exemplified by the family of dot product models, which associate each node with a real-valued "embedding" vector; the predicted probability of a link between two nodes increases with the similarity of their embedding vectors. These models can alternatively be seen as factorizing the graph's adjacency matrix to approximate it with a low-rank matrix. Recent work of Seshadhri et al. (2020) has shown that dot product models are limited in their ability to model common structures in real-world graphs, such as triangles incident only on low-degree nodes. In response, Chanpuriya et al. (2020) showed that with the logistic principal components analysis (LPCA) model, which has two embeddings per node (i.e., using the dot product of the 'left' embedding of one node and the 'right' embedding of another), not only can such structures be represented, but further, any graph can be exactly represented with embedding vectors whose lengths are linear in the max degree of the graph. There are two keys to this result. First is the presence of a nonlinear linking function in the LPCA model; since adjacency matrices are generally not low-rank, exact low-rank factorization is generally impossible without a linking function. Second is that having two embeddings rather than one allows for expression of non-positive semidefinite (PSD) matrices. As discussed in Peysakhovich & Bottou (2021) that the single-embedding models can only represent PSD matrices precludes representation of 'heterophilous' structures in graphs; heterophilous structures are those wherein dissimilar nodes are linked, in contrast to more intuitive 'homophilous' linking between similar nodes. Interpretability and node clustering Beyond being able to capture a given network accurately, it is often desirable for a graph model to form interpretable representations of nodes and to produce edge probabilities in an interpretable fashion. Dot product models can achieve this by restricting the node embeddings to be nonnegative. Nonnegative factorization has long been used to decompose data into parts (Donoho & Stodden, 2003) . In the context of graphs, this entails decomposing the set of nodes of the network into clusters or communities. In particular, each entry of the nonnegative embedding vector of a node represents the intensity with which the node participates in a community. This allows the edge probabilities output by dot product models to be interpretable in terms of coparticipation in communities. Depending on the model, these vectors may have restrictions such as a sum-to-one requirement, meaning the node is assigned a categorical distribution over communities. The least restrictive and most expressive case is that of soft assignments to overlapping communities, where the entries can vary totally independently. In such models, which include the BIGCLAM model of Yang & Leskovec (2013) , the output of the dot product may be mapped through a nonlinear link function (as in LPCA) to produce a probability for each edge, i.e., to ensure the values lie in [0, 1]. Heterophily: Motivating example To demonstrate how heterophily can manifest in networks, as well as how models which assume homophily can fail to represent such networks, we provide a simple synthetic example. Suppose we have a graph of matches between users of a mostly heterosexual dating app, and the users each come from one of ten cities. Members from the same city are likely to match with each other; this typifies homophily, wherein links occur between similar nodes. Furthermore, users having the same gender are are unlikely to match with each other; this typifies heterophily. (Note that a mostly homosexual dating app would not exhibit heterophily in this sense.) Figure 1 shows an instantiation of such an adjacency matrix with 1000 nodes, which are randomly assigned to man or woman and to one of the ten cities. We recreate this network with our proposed embedding model and with BIGCLAM, which explicitly assumes homophily. (It is far from alone in this assumption; see Li et al. (2018) for a recent example, along with more examples and further discussion in Section 3.) We also compare with the SVD of the adjacency matrix, which outputs the best (lowest Frobenius error) low-rank approximation that is possible without a nonlinear linking function. Since SVD lacks nonnegativity constraints on the factors, we do not expect intepretability. In Figure 1 , we show how BIGCLAM captures only the ten communities based on city, i.e., only the homophilous structure, and fails to capture the heterophilous distinction between men and women. We also plot the error of the reconstructions as the embedding length increases. There are 10 • 2 = 20 different kinds of nodes, meaning the expected adjacency matrix is rank-20, and our model maintains the lowest error up to this embedding length; by contrast, BIGCLAM is unable to decrease error after capturing city information with length-10 embeddings. In Figure 3 , we visualize the features generated by the three methods, i.e., the factors returned by each factorization. Our model's factors captures the relevant latent structure in an interpretable way. By contrast, SVD's factors are harder to interpret, and BIGCLAM does not represent the heterophilous structure.

Summary of main contributions

The key contributions of this work are as follows: • We prove that the LPCA model admits exact low-rank factorizations of graphs with bounded arboricity, which is the minimum number of forests into which a graph's edges can be partitioned. By the Nash-Williams theorem, arboricity is a measure of a graph's density in that, letting S denote an induced subgraph and n S and m S denote the number of nodes and edges in S, arboricity is the maximum over all subgraphs S of ⌈ m S n S -1 ⌉. Our result is more applicable to real-world graphs than the prior one for graphs with bounded max degree, since sparsity is a common feature of real networks, whereas low max degree is not. • We introduce a graph model which is both highly expressive and interpretable. Our model incorporates two embeddings per node and a nonlinear linking function, and hence is able to express both heterophily and overlapping communities. At the same time, our model is based on symmetric nonnegative matrix factorization, so it outputs link probabilities which are interpretable in terms of the communities it detects. • We show how any graph with a low-rank factorization in the LPCA model also admits a low-rank factorization in our community-based model. This means that the guarantees on low-rank representation for bounded max degree and arboricity also apply to our model. • In experiments, we show that our method is competitive with and often outperforms other comparable models on real-world graphs in terms of representing the network, doing interpretable link prediction, and detecting communities that align with ground-truth. 1 with SVD, BIGCLAM, and our model, using 12 communities or singular vectors. Note the lack of the small diagonal structure in BIGCLAM's reconstruction; this corresponds to its inability to capture the heterophilous interaction between men and women. Right: Frobenius error when reconstructing the motivating synthetic graph of Figure 1 with SVD, BIGCLAM, and our model, as the embedding length is varied. The error is normalized by the sum of the true adjacency matrix (i.e., twice the number of edges). 1 with the three models, using 12 communities or singular vectors. The top/bottom rows represent the positive/negative eigenvalues corresponding to homophilous/heterophilous communities (note that BIGCLAM does not include the latter). The homophilous factors from BIGCLAM and our model reflect the 10 cities, and the heterophilous factor from our model reflect men and women. The factors from SVD are harder to interpret. Note that the order of the communities in the factors is arbitrary.

2. COMMUNITY-BASED GRAPH FACTORIZATION MODEL

Consider the set of undirected, unweighted graphs on n nodes, i.e., the set of graphs with symmetric adjacency matrices in {0, 1} n×n . We propose an edge-independent generative model for such graphs. Given nonnegative parameter matrices B ∈ R n×k B + and C ∈ R n×k C + , we set the probability of an edge existing between nodes i and j to be the (i, j)-th entry of matrix Ã: Ã := σ(BB ⊤ -CC ⊤ ), ( ) where σ is the logistic function. Here k B , k C are the number of homophilous/heterophilous clusters. Intuitively, if b i ∈ R k B + is the i-th row of matrix B, then b i is the affinity of node i to each of the k B homophilous communities. Similarly, c i ∈ R k C + is the affinity of node i to the k C heterophilous communities. As an equivalent statement, for each pair of nodes i and j, Ãi,j := σ(b i b ⊤ j -c i c ⊤ j ). We will soon discuss the precise interpretation of this model, but the idea is roughly similar to the attract-repel framework of Peysakhovich & Bottou (2021) . When nodes i and j have similar 'attractive' b embeddings, i.e., when b i b ⊤ j is high, the likelihood of an edge between them increases, hence why the B factor is homophilous. By contrast, the C factor is 'repulsive'/heterophilous since, when c i c ⊤ j is high, the likelihood of an edge between i and j decreases.

Alternate expression

We note that the model above can also be expressed in a form which normalizes cluster assignments and is more compact, in that it combines the homophilous and heterophilous cluster assignments. Instead of B and C, this form uses a matrix V ∈ [0, 1] n×k and a diagonal matrix W ∈ R k×k , where k = k B + k C is the total number of clusters. In particular, let m B and m C be the vectors containing the maximums of each column of B and C. By setting V = B × diag m -1 B ; C × diag m -1 C W = diag +m 2 B ; -m 2 C , the constraint on V is satisfied. Further, V W V ⊤ = BB ⊤ -CC ⊤ , so Ã := σ(BB ⊤ -CC ⊤ ) = σ(V W V ⊤ ). Here, if v i ∈ [0, 1] k is the i-th row of matrix V , then v i is the soft (normalized) assignment of node i to the k communities. The diagonal entries of W represent the strength of the homophily (if positive) or heterophily (if negative) of the communities. For each entry, Ãi,j = σ(v i W v ⊤ j ). We use these two forms interchangeably throughout this work. Interpretation The edge probabilities output by this model have an intuitive interpretation. Recall that there are bijections between probability p ∈ [0, 1], odds o = p 1-p ∈ [0, ∞), and logit ℓ = log(o) ∈ (-∞, +∞). The logit of the link probability between nodes i and j is v ⊤ i W v j , which is a summation of terms v ic v jc W cc over all communities c ∈ [k]. If the nodes both fully participate in community c, that is, v ic = v jc = 1, then the edge logit is changed by W cc starting from a baseline of 0, or equivalently, the odds of an edge is multiplied by exp(W cc ) starting from a baseline odds of 1; if either of the nodes participates only partially in community c, then the change in logit and odds is accordingly prorated. Homophily and heterophily also have a clear interpretation in this model: homophilous communities, which are expressed in B, are those with W cc > 0, where two nodes both participating in the community increases the odds of a link, whereas communities with W cc < 0, which are expressed in C, are heterophilous, and coparticipation decreases the odds of a link.

Community detection via interpretable factorizations

There is extensive prior work on the community detection / node clustering problem (Schaeffer, 2007; Aggarwal & Wang, 2010; Nascimento & De Carvalho, 2011) , perhaps the most well-known being the normalized cuts algorithm of Shi & Malik (2000) , which produces a clustering based on the entrywise signs of an eigenvector of the graph Laplacian matrix. However, the clustering algorithms which are most relevant to our work are those based on non-negative matrix factorization (NMF) (Lee & Seung, 1999; Berry et al., 2007; Wang & Zhang, 2012; Gillis, 2020) , many of which can be seen as integrating nonnegativity constraints into the broader, well-studied random dot product graph (RDPG) model (Young & Scheinerman, 2007; Scheinerman & Tucker, 2010; Athreya et al., 2017; Gallagher et al., 2021; Marenco et al., 2022) . One such algorithm is that of Yu et al. (2005) , which approximately factors a graph's adjacency matrix A ∈ {0, 1} n×n into two positive matrices H and Λ, where H ∈ R n×k + is left-stochastic (i.e. each of its columns sums to 1) and Λ ∈ R k×k + is diagonal, such that HΛH ⊤ ≈ A. Here H represents a soft clustering of the n nodes into k clusters, while the diagonal entries of Λ represent the prevalence of edges within clusters. Note the similarity of the factorization to our model, save for the lack of a nonlinearity. Other NMF approaches include those of Ding et al. (2008) , Yang et al. (2012) , Kuang et al. (2012) , and Kuang et al. (2015) (SYMNMF). Modeling heterophily Much of the existing work on graph models has an underlying assumption of network homophily (Johnson et al., 2010; Noldus & Van Mieghem, 2015) . There has been significant recent interest in the limitations of graph neural network (GNN) models (Duvenaud et al., 2015; Kipf & Welling, 2017; Hamilton et al., 2017) at addressing network heterophily (NT & Maehara, 2019; Zhu et al., 2020; Zheng et al., 2022) , as well as proposed solutions (Pei et al., 2020; Yan et al., 2021) , but relatively less work for more fundamental models such as those for clustering. Some existing NMF approaches to clustering do naturally model heterophilous structure in networks. For example, the Generalized RDPG (Rubin-Delanchy et al., 2017) allows for the representation of non-PSD adjacency matrices by using a "tri-factorization" like that of Yu et al. (2005) , but removing the constraint that the central diagonal matrix must be nonnegative. The model of Nourbakhsh et al. (2014) goes further, removing all constraints on this matrix. Another example is the model in Miller et al. (2009) , which is similar to ours, though it restricts the cluster assignment matrix V to be binary; additionally, their training algorithm is not based on gradient descent as ours is, and it does not scale to larger networks. More recently, Peysakhovich & Bottou (2021) propose a decomposition of the form A ≈ D + BB ⊤ -CC ⊤ , where D ∈ R n×n is diagonal and B, C ∈ R n×k are low-rank. Note that their decomposition does not include a nonlinear linking function, and their work does not pursue a clustering interpretation or investigate setting the factors B and C to be nonnegative. Overlapping communities and exact embeddings Many models discussed above focus on the single-label clustering task and thus involve highly-constrained factorizations (e.g., sum-to-one conditions). We are interested in the closely related but distinct task of multi-label clustering, also known as overlapping community detection (Xie et al., 2013; Javed et al., 2018) , which involves less constrained, more expressive factorizations. The BIGCLAM algorithm of Yang & Leskovec (2013) uses the following generative model for this task: the probability of a link between two nodes i and j is given by 1 -exp(-f i • f j ), where f i , f j ∈ R k + represent the intensities with which the nodes participate in each of the k communities. Note that BIGCLAM assumes strict homophily of the communities: two nodes participating in the same community always increases the probability of a link. However, this model allows for expression of very dense intersections of communities, which the authors observe is generally a characteristic of real-world networks. To ensure that output entries are probabilities, BIGCLAM's factorization includes a nonlinear linking function (namely, f (x) = 1 -e x ), like our model and LPCA. Recent work outside clustering and community detection on graph generative models (Rendsburg et al., 2020; Chanpuriya et al., 2020) suggests that incorporating a nonlinear linking function can greatly increase the expressiveness of factorization-based graph models, to the point of being able to exactly represent a graph. This adds to a growing body of literature on expressiveness guarantees for embeddings on relational data (Sala et al., 2018; Bhattacharjee & Dasgupta, 2020; Boratko et al., 2021) . Most relevant to our work, as previously discussed, Chanpuriya et al. (2020) provide a guarantee for exact low-rank representation of graphs with bounded max degree when using the LPCA factorization model. In this work, we provide a new such guarantee, except for bounded arboricity, which is more applicable to real-world networks, and extend these guarantees to our community-based factorization.

4. THEORETICAL RESULTS

We first restate the main result from Chanpuriya et al. (2020) on exact representation of graphs with bounded max degree using the logistic principal components analysis (LPCA) model, which reconstructs a graph A ∈ {0, 1} n×n using logit factors X, Y ∈ R n×k via A ≈ σ(XY ⊤ ). (4) Note that unlike our community-based factorization, the factors of the LPCA model are not nonnegative, and the factorization does not reflect the symmetry of the undirected graph's adjacency matrix. Regardless of the model's interpretability, the following theorem provides a significant guarantee on its expressiveness. We use the following notation: given a matrix M , let H(M ) denote the matrix resulting from entrywise application of the Heaviside step function to M , that is, setting all positive entries to 1, negative entries to 0, and zero entries to 1 /2. Theorem 4.1 (Exact LPCA Factorization for Bounded-Degree Graphs, Chanpuriya et al. ( 2020)). Let A ∈ {0, 1} n×n be the adjacency matrix of a graph G with maximum degree c. Then there exist matrices X, Y ∈ R n×(2c+1) such that A = H(XY ⊤ ). This corresponds to arbitrarily small approximation error in the LPCA model (Equation 4) because, provided such factors X, Y for some graph A, we have that lim s→∞ σ sXY ⊤ = H(XY ⊤ ) = A. That is, we can scale the factors larger to reduce the error to an arbitrary extent. We expand on this result in two ways. First, give a new bound for exact embedding in terms of arboricity, rather than max degree. This increases the applicability to real-world networks, which often are sparse (i.e., low arboricity) and have right-skewed degree distributions (i.e., high max degree). Second, we show that any rank-k LPCA factorization can be converted to our model's symmetric nonnegative factorization with O(k) communities. This extends the guarantees on the LPCA model's power for exact representation of graphs, both the prior guarantee in terms of max degree and our new one in terms of arboricity, to our community-based model as well. In Appendix A.1, we also introduce a natural family of graphs -Community Overlap Threshold (COT) graphs -for which our model's community-based factorization not only exactly represents the graph, but also must capture some latent structure to do so with sufficiently low embedding dimensionality. Arboricity bound for exact representation We will use the following well-known fact: the rank of the entrywise product of two matrices is at most the product of their individual ranks, that is, rank(X • Y ) ≤ rank(X) • rank(Y ). Theorem 4.2 (Exact LPCA Factorization for Bounded-Arboricity Graphs). Let A ∈ {0, 1} n×n be the adjacency matrix of an undirected graph G with arboricity α. Then there exist embeddings X, Y ∈ R n×(4α 2 +1) such that A = H(XY ⊤ ). Proof. Let the undirected graph A have arboricity α, i.e., the edges can be partitioned into α forests. We produce a directed graph B from A by orienting the edges in these forests so that each node's edges point towards its children. Now A = B + B ⊤ , and every node in B has in-degree at most α. Let V ∈ R n×2α be the Vandermonde matrix with V t,j = t j-1 . For any c ∈ R 2α , [V c](t) = 2α j=1 c(j) • t j-1 , that is, V c ∈ R n is a degree-(2α ) polynomial with coefficients c evaluated at the integers t ∈ [n] = {1, . . . , n}. Let b i be the i th column of B. We seek to construct a polynomial such that for t with b i (t) = 1, [V c i ](t) = 0, and [V c i ](t) < 0 elsewhere; that is, when inputting an index t ∈ [n] such that the t th node is an in-neighbor of the i th node, we want the polynomial to output 0, and for all other indices in [n], we want it to have a negative output. Letting N (i) denote the in-neighbors of the i th node, a simple instantiation of such a polynomial in t is -1 • j∈N (i) (t -j) 2 . Note that since all nodes have in-degree at most α, this polynomial's degree is at most 2α, and hence there exists a coefficient vector c i ∈ R 2α encoding this polynomial. Let C ∈ R n×2α be the matrix resulting from stacking such coefficient vectors for each of the n nodes. Consider P = V C ∈ R n×n : P i,j is 0 if B i,j = 1 and negative otherwise. Then (P • P ⊤ ) i,j is 0 when either B i,j = 1 or (B ⊤ ) i,j = 1 and positive otherwise; equivalently, since A = B + B ⊤ , (P • P ⊤ ) i,j = 0 iff A i,j = 1. Take any positive ϵ less than the smallest positive entry of P • P ⊤ . Letting J be an all-ones matrix, define M = ϵJ -(P • P ⊤ ). Note that M i,j > 0 if A = 1 and M i,j < 0 if A = 0, that is, M = H(A) as desired. Since rank(J ) = 1 and rank(P ) ≤ 2α, by the bound on the rank of entrywise products of matrices, the rank of M is at most (2α) 2 + 1. ■ Exact representation with community factorization LPCA factors X, Y ∈ R n×k can be processed into nonnegative factors B ∈ R n×k B + and C ∈ R n×k C + such that k B + k C = 6k and BB ⊤ -CC ⊤ = 1 2 XY ⊤ + Y X ⊤ . ( ) Observe that the left-hand side can only represent symmetric matrices, but XY ⊤ is not necessarily symmetric even if H(XY ⊤ ) = A for a symmetric A. For this reason, we use a symmetrization: let L = 1 2 XY ⊤ + Y X ⊤ . Note that H(L) = H(XY ⊤ ), so if XY ⊤ constitutes an exact representation of A in that H(XY ⊤ ) = A, so too do both expressions for L in Equation 5. Pseudocode for the procedure of constructing B, C given X, Y is given in Algorithm 1. The concept of this algorithm is to first separate the logit matrix L into a sum and difference of rank-1 components via eigendecomposition. Each of these components can be written as +vv ⊤ or -vv ⊤ with v ∈ R n , where the sign depends on the sign of the eigenvalue. Each component is then separated into a sum and difference of three outer products of nonnegative vectors, via Lemma 4.3 below. Lemma 4.3. Let ϕ : R → R denote the ReLU function, i.e., ϕ(z) = max{z, 0}. For any vector v, vv ⊤ = 2ϕ(v)ϕ(v) ⊤ + 2ϕ(-v)ϕ(-v) ⊤ -|v||v| ⊤ . Proof. Take any v ∈ R k . Then vv ⊤ = (ϕ(v) -ϕ(-v)) • (ϕ(v) ⊤ -ϕ(-v) ⊤ ) = ϕ(v)ϕ(v) ⊤ + ϕ(-v)ϕ(-v) ⊤ -ϕ(v)ϕ(-v) ⊤ -ϕ(-v)ϕ(v) ⊤ = 2ϕ(v)ϕ(v) ⊤ + 2ϕ(-v)ϕ(-v) ⊤ -(ϕ(v) + ϕ(-v)) • (ϕ(v) + ϕ(-v)) ⊤ = 2ϕ(v)ϕ(v) ⊤ + 2ϕ(-v)ϕ(-v) ⊤ -|v||v| ⊤ , where the first step follows from v = ϕ(v)-ϕ(-v), and the last step from |v| = ϕ(v)+ϕ(-v). ■ Algorithm 1 follows from Lemma 4.3 and constitutes a constructive proof of the following theorem: Theorem 4.4 (Exact Community Factorization from Exact LPCA Factorization). Given a symmetric matrix A ∈ {0, 1} and X, Y ∈ R n×k such that A = H(XY ⊤ ), there exist nonnegative matrices B ∈ R n×k B + and C ∈ R n×k C + such that k B + k C = 6k and A = H(BB ⊤ -CC ⊤ ). Algorithm 1 Converting LPCA Factorization to Community Factorization input logit factors X, Y ∈ R n×k output B ∈ R n×k B + and C ∈ R n×k C + such that k B + k C = 6k and BB ⊤ -CC ⊤ = 1 2 XY ⊤ + Y X ⊤ 1: Set Q ∈ R n×2k and λ ∈ R 2k by truncated eigendecomposition such that Q × diag(λ) × Q ⊤ = 1 2 (XY ⊤ + Y X ⊤ ) 2: B * ← Q + × diag( √ +λ + ), where λ + , Q + are the positive eigenvalues/vectors 3: C * ← Q -× diag( √ -λ -), where λ -, Q -are the negative eigenvalues/vectors 4: B ← √ 2ϕ(B * ); √ 2ϕ(-B * ); |C * | ▷ ϕ and | • | are entrywise ReLU and absolute value 5: C ← √ 2ϕ(C * ); √ 2ϕ(-C * ); |B * | 6: return B, C As stated in the introduction to this section, Theorem 4.4 extends any upper bound on the exact factorization dimensionality from the LPCA model to our community-based model. That is, up to a constant factor, the bound in terms of max degree from Theorem 4.1 and the bound in terms of arboricity from Theorem 4.2 also apply to our model; for brevity, we state just the latter here. Corollary 4.5 (Exact Community Factorization for Bounded-Arboricity Graphs). Let A ∈ {0, 1} n×n be the adjacency matrix of an undirected graph G with arboricity α. Then there exist nonnegative embeddings B ∈ R n×k B + and C ∈ R n×k C + such that k B + k C = 6(4α 2 + 1) and A = H(BB ⊤ -CC ⊤ ). Note that Corollary 4.5 is purely a statement about the capacity of our model; Theorem 4.2 stems from a constructive proof based on polynomial interpolation, and therefore so too does this corollary. We do not expect this factorization to be informative about the graph's latent structure. In the following Section 5, we will fit the model with an entirely different algorithm for downstream applications.

5. EXPERIMENTS

We now present a training algorithm to fit our model, then evaluate our method on a benchmark of five real-world networks. These are fairly common small to mid-size datasets ranging from around 1K to 10K nodes; for brevity, we defer the statistics and discussion of these datasets, including how some of them exhibit heterophily, to Appendix A.2.

5.1. TRAINING ALGORITHM

Given an input graph A ∈ {0, 1} n×n , we find low-rank nonnegative matrices B and C such that the model produces Ã = σ(BB ⊤ -CC ⊤ ) ∈ (0, 1) n×n as in Equation 1 which approximately matches A. In particular, we train the model to minimize the sum of binary cross-entropies of the link predictions over all pairs of nodes: R = - A log( Ã) + (1 -A) log(1 -Ã) , where denotes the scalar summation of all entries in the matrix. We fit the parameters by gradient descent over this loss, as well as L 2 regularization of the factors B and C, subject to the nonnegativity of B and C. This algorithm is fairly straightforward; pseudocode is given in Algorithm 2. This is quite similar to the training algorithm of Chanpuriya et al. (2020) , but in contrast to that work, which only targets an exact fit, we explore the expression of graph structure in the factors and their utility in downstream tasks. Regularization of the factors is implemented to this end to avoid overfitting. Though we outline a non-stochastic version of the training algorithm, it generalizes straightforwardly to a stochastic version, i.e., by sampling links and non-links for the loss function. Update B, C to minimize R using ∂ B,C R, subject to B, C ≥ 0 8: end for 9: return B, C Implementation details Our implementation uses PyTorch (Paszke et al., 2019) for automatic differentiation and minimizes loss using the SciPy (Jones et al., 2001) implementation of the L-BFGS (Liu & Nocedal, 1989; Zhu et al., 1997) algorithm with default hyperparameters and up to a max of 200 iterations of optimization. We set regularization weight λ = 10 as in Yang & Leskovec (2013) . We include code in the form of a Jupyter notebook (Pérez & Granger, 2007) demo. The code also contains stochastic (i.e., more scalable) versions of the optimization functions.

5.2. RESULTS

Expressiveness First, we investigate the expressiveness of our generative model, that is, the fidelity with which it can reproduce an input network. In Section 1, we used a simple synthetic network to show that our model is more expressive than others due to its ability to represent heterophilous structures in addition to homophilous ones. We now evaluate the expressiveness of our model on realworld networks. As with the synthetic graph, we fix the number of communities or singular vectors, fit the model, then evaluate the reconstruction error. In Figure 4 , we compare the results of our model with those of SVD, BIGCLAM (which is discussed in detail in Section 3), and SYMNMF (Kuang et al., 2015) . SYMNMF simply factors the adjacency matrix as A ≈ HH ⊤ , where H ∈ R n×k + ; note that, like SVD, SYMNMF does not necessarily output a matrix whose entries are probabilities (i.e., bounded in [0, 1]), and hence it is not a graph generative model like ours and BIGCLAM. For each method, we fix the number of communities or singular vectors at the ground-truth number. For this experiment only, we are not concerned with learning the latent structure of the graph; the only goal is accurate representation of the network with limited parameters. So, for a fair comparison with SVD, we do not regularize the training of the other methods. Our method consistently has the lowest reconstruction error, both in terms of Frobenius error and entrywise cross-entropy (Equation 6). Interestingly, we find the most significant improvement exactly on the three datasets which have been noted to exhibit significant heterophily: POS, PPI, and AMAZON. Similarity to ground-truth communities To assess the interpretability of clusters generated by our method, we evaluate the similarity of these clusters to ground-truth communities (i.e., class labels), and we compare other methods for overlapping clustering. We additionally compare to another recent but non-generative approach, the VGRAPH method of Sun et al. (2019) , which is based on link clustering; the authors found their method to generally achieve state-of-the-art results in this task. For all methods, we set the number of communities to be detected as the number of ground-truth communities. We report F1-Score as computed in Yang & Leskovec (2013) . See Figure 5 (left): the performance of our method is competitive with SYMNMF, BIGCLAM, and vGraph. We are unable to run the authors' implementation of VGRAPH on BLOG with limited memory. Right: Accuracy of link prediction on real-world datasets. Interpretable link prediction We assess the predictive power of our generative model on the link prediction task. As discussed in Section 2, the link probabilities output by our model are interpretable in terms of a clustering of nodes that it generates; we compare results with our method to those with other models which permit similar interpretation, namely BIGCLAM and SYMNMF. We randomly select 10% of node pairs to hold out, fit the models on the remaining 90%, then use the trained models to predict links between node pairs in the held out 10%. As a baseline, we also show results for randomly predicting link or no link with equal probability. See Figure 5 (right). The performance of our method is competitive with or exceeds that of the other methods in terms of F1 Score.

6. CONCLUSION

We introduce a community-based graph generative model based on symmetric nonnegative matrix factorization which is capable of representing both homophily and heterophily. We expand on prior work guaranteeing exact representation of bounded max degree graphs with a new, more applicable guarantee for bounded arboricity graphs, and we show that both the prior bound and our new one apply to our more interpretable graph model. We illustrate our model's capabilities with experiments on a synthetic motivating example. Experiments on real-world networks show its effectiveness on several key tasks. More broadly, our results suggest that incorporating heterophily into models and methods for networks can improve both theoretical grounding and overall empirical performance, while maintaining simplicity and interpretability. A deeper understanding of the expressiveness of both nonnegative and arbitrary low-rank logit models for graphs is an interesting future direction.

A.2 DATASET DESCRIPTIONS

As stated in Section 5, we now briefly describe the five real-world datasets. Statistics for these datasets are given in Table 1 . BLOG is a social network of relationships between online bloggers; the node labels represent interests of the bloggers. Similarly, YOUTUBE is a social network of YouTube users, and the labels represent groups that the users joined. POS is a word co-occurrence network: nodes represent words, and there are edges between words which are frequently adjacent in a section of the Wikipedia corpus. Each node label represents the part-of-speech of the word. PPI is a subgraph of the protein-protein interaction network for Homo Sapiens. Labels represent biological states. Finally, AMAZON is a co-purchasing network: nodes represent products, and there are edges between products which are frequently purchased together. Labels represent categories of products. While social networks like the former two in this list are generally dominated by homophily (McPherson et al., 2001) , the latter three should exhibit significant heterophily. For co-purchasing networks like AMAZON, depending on the product, two of the same kind of product are generally not copurchased, e.g., Pepsi and Coke, as discussed in Peysakhovich & Bottou (2021) . Though less intuitively accessible, there is also prior discussion of disassortativity in word adjacencies (Foster et al., 2010; Zweig, 2016) , as well as in PPI networks (Newman, 2002; Hase et al., 2010) .



Figure 1: The motivating synthetic graph. The expected adjacency matrix (left) and the sampled matrix (right); the latter, which is passed to the training algorithms, is produced by treating the entries of the former as parameters of Bernoulli distributions and sampling. The network is approximately a union of ten bipartite graphs, each of which correspond to men and women in one of the ten cities.

Figure 2: Left: Reconstructions of the motivating synthetic graph of Figure1with SVD, BIGCLAM, and our model, using 12 communities or singular vectors. Note the lack of the small diagonal structure in BIGCLAM's reconstruction; this corresponds to its inability to capture the heterophilous interaction between men and women. Right: Frobenius error when reconstructing the motivating synthetic graph of Figure1with SVD, BIGCLAM, and our model, as the embedding length is varied. The error is normalized by the sum of the true adjacency matrix (i.e., twice the number of edges).

Figure 3: Factors resulting from decomposition of the motivating synthetic graph of Figure1with the three models, using 12 communities or singular vectors. The top/bottom rows represent the positive/negative eigenvalues corresponding to homophilous/heterophilous communities (note that BIGCLAM does not include the latter). The homophilous factors from BIGCLAM and our model reflect the 10 cities, and the heterophilous factor from our model reflect men and women. The factors from SVD are harder to interpret. Note that the order of the communities in the factors is arbitrary.

Fitting the Constrained Model input adjacency matrixA ∈ {0, 1} n×n , regularization weight λ ≥ 0, number of iterations I, number of homophilous/heterophilous communities k B /k C output fitted factors B ∈ R n×k B + and C ∈ R n×k C + such that σ(BB ⊤ -CC ⊤ ) ≈ A 1: Initialize B, C by setting entries to independent samples of Unif(0, 1 / √ k B ), Unif(0, 1 / √ k C ) 2: for i ← 1 to I do 3: Ã ← σ(BB ⊤ -CC ⊤ )

Figure 4: Reconstruction error on real-world networks, relative to our model's error.

Figure 5: Left: Similarity of recovered communities to ground-truth labels of real-world datasets.We are unable to run the authors' implementation of VGRAPH on BLOG with limited memory. Right: Accuracy of link prediction on real-world datasets.

Statistics of datasets used in our experiments. As inSun et al. (2019), for YOUTUBE and AMAZON, we take only nodes which participate in at least one of the largest 5 ground-truth communities. Note that degeneracy is an upper bound on arboricity.

A APPENDIX

A.1 COT GRAPH EXACT REPRESENTATION As a theoretical demonstration of the capability of our model to learn latent structure, we additionally show that our model can exactly represent a natural family of graphs, which exhibits both homophily and heterophily, with small k and interpretably. The family of graphs is specified below in Definition 1; roughly speaking, nodes in such graphs share an edge iff they coparticipate in some number of homophilous communities and don't coparticipate in a number of heterophilous communities. For example, the motivating graph described in Section 1 would be an instance of such a graph if an edge occurs between two users iff the two users are from the same city and have different genders. Definition 1 (Community Overlap Threshold (COT) Graph). An unweighted, undirected graph whose edges are determined by an overlapping clustering and a "thresholding" integer t ∈ Z as follows: for each vertex i, there are two latent binary vectors b i ∈ {0, 1} k b and c i ∈ {0, 1} kc , and there is an edge between vertices i and j iff b i • b j -c i • c j ≥ t. Theorem A.1 (Compact Representation of COT Graphs). Suppose A is the adjacency matrix of a COT graph on n nodes with latent vectors bProof. Let t be the thresholding integer of the graph, and let the rows of B ∈ {0, 1} n×k b and C ∈ {0, 1} n×kc contain the vectors b and c of all nodes. Via Equation 2, we can find 

