UNVEILING THE SAMPLING DENSITY IN NON-UNIFORM GEOMETRIC GRAPHS

Abstract

A powerful framework for studying graphs is to consider them as geometric graphs: nodes are randomly sampled from an underlying metric space, and any pair of nodes is connected if their distance is less than a specified neighborhood radius. Currently, the literature mostly focuses on uniform sampling and constant neighborhood radius. However, real-world graphs are likely to be better represented by a model in which the sampling density and the neighborhood radius can both vary over the latent space. For instance, in a social network communities can be modeled as densely sampled areas, and hubs as nodes with larger neighborhood radius. In this work, we first perform a rigorous mathematical analysis of this (more general) class of models, including derivations of the resulting graph shift operators. The key insight is that graph shift operators should be corrected in order to avoid potential distortions introduced by the non-uniform sampling. Then, we develop methods to estimate the unknown sampling density in a self-supervised fashion. Finally, we present exemplary applications in which the learned density is used to 1) correct the graph shift operator and improve performance on a variety of tasks, 2) improve pooling, and 3) extract knowledge from networks. Our experimental findings support our theory and provide strong evidence for our model.

1. INTRODUCTION

Graphs are mathematical objects used to represent relationships among entities. Their use is ubiquitous, ranging from social networks to recommender systems, from protein-protein interactions to functional brain networks. Despite their versatility, their non-euclidean nature makes graphs hard to analyze. For instance, the indexing of the nodes is arbitrary, there is no natural definition of orientation, and neighborhoods can vary in size and topology. Moreover, it is not clear how to compare a general pair of graphs since they can have a different number of nodes. Therefore, new ways of thinking about graphs were developed by the community. One approach is proposed in graphon theory (Lovász, 2012) : graphs are sampled from continuous graph models called graphons, and any two graphs of any size and topology can be compared using certain metrics defined in the space of graphons. A geometric graph is an important case of a graph sampled from a graphon. In a geometric graph, a set of points is uniformly sampled from a metric-measure space, and every pair of points is linked if their distance is less than a specified neighborhood radius. Therefore, a geometric graph inherits a geometric structure from its latent space that can be leveraged to perform rigorous mathematical analysis and to derive computational methods. Geometric graphs have a long history, dating back to the 60s (Gilbert, 1961) . They have been extensively used to model complex spatial networks (Barthelemy, 2011) . One of the first models of geometric graphs is the random geometric graph (Penrose, 2003) , where the latent space is a Euclidean unit square. Various generalizations and modifications of this model have been proposed in the literature, such as random rectangular graphs (Estrada & Sheerin, 2015) , random spherical graphs (Allen-Perkins, 2018), and random hyperbolic graphs (Krioukov et al., 2010) . Geometric graphs are particularly useful since they share properties with real-world networks. For instance, random hyperbolic graphs are small-world, scale-free, with high clustering (Papadopoulos et al., 2010; Gugelmann et al., 2012) . The small-world property asserts that the distance between any two nodes is small, even if the graph is large. The scale-free property is the description of the degree sequence as a heavy-tailed distribution: a small number of nodes have many connections, while the rest have small neighborhoods. These two properties are related to the presence of hubs -nodes with large neighborhoods -while the high clustering is related to the network's community structure. However, standard geometric graph models focus mainly on uniform sampling, which does not describe real-world networks well. For instance, in location-based social networks, the spatial distribution of nodes is rarely uniform because people congregate around the city centers (Cho et al., 2011; Wang & González, 2009) . In online communities such as the LiveJournal social network, non-uniformity arises since the probability of befriending a particular person is inversely proportional to the number of closer people (Hu et al., 2011; Liben-Nowell et al., 2005) . In a WWW network, there are more pages for popular topics than obscure ones. In social networks, different demographics (age, gender, ethnicity, etc.) may join a social media platform at different rates. For surface meshes, specific locations may be sampled more finely, depending on the required level of detail. The imbalance caused by non-uniform sampling could affect the analysis and lead to biased results. For instance, Janssen et al. (2016) show that incorrectly assuming uniform density consistently overestimates the node distances while using the (estimated) density gives more accurate results. Therefore, it is essential to assess the sampling density, which is one of the main goals of this paper. Barring a few exceptions, non-uniformity is rarely considered in geometric graphs. Iyer & Thacker (2012) study a class of non-uniform random geometric graphs where the radii depend on the location. Martínez-Martínez et al. (2022) study non-uniform graphs on the plane with the density functions specified in polar coordinates. Pratt et al. (2018) consider temporal connectivity in finite networks with non-uniform measures. In all of these works, the focus is on (asymptotic) statistical properties of the graphs, such as the average degree and the number of isolated nodes.

1.1. OUR CONTRIBUTION

While traditional Laplacian approximation approaches solve the direct problem -approximating a known continuous Laplacian with a graph Laplacian -in this paper we solve the inverse problemconstructing a graph Laplacian from an observed graph that is guaranteed to approximate an unknown continuous Laplacian. We believe that our approach has high practical significance, as in practical data science on graphs, the graph is typically given, but the underlying continuous model is unknown. To be able to solve this inverse problem, we introduce the non-uniform geometric graph (NuG) model. Unlike the standard geometric graph model, a NuG is generated by a non-uniform sampling density and a non-constant neighborhood radius. In this setting, we propose a class of graph shift operators (GSOs), called non-uniform geometric GSOs, that are computed solely from the topology of the graph and the node/edge features while guaranteeing that these GSOs approximate corresponding latent continuous operators defined on the underlying geometric spaces. Together with Dasoulas et al. (2021) and Sahbi (2021) , our work can be listed as a theoretically grounded way to learn the GSO. Justified by formulas grounded in Monte-Carlo analysis, we show how to compensate for the nonuniformity in the sampling when computing non-uniform geometric GSOs. This requires having estimates both of the sampling density and the neighborhood radii. Estimating these by only observing the graph is a hard task. For example, graph quantities like the node degrees are affected both by the density and the radius, and hence, it is hard to decouple the density from the radius by only observing the graph. We hence propose methods for estimating the density (and radius) using a self-supervision approach. The idea is to train, against some arbitrary task, a spectral graph neural network, where the GSOs underlying the convolution operators are taken as a non-uniform geometric GSO with learnable density. For the model to perform well, it learns to estimate the underlying sampling density, even though it is not directly supervised to do so. We explain heuristically the feasibility of the self-supervision approach on a sub-class of non-uniform geometric graphs that we call geometric graphs with hubs. This is a class of geometric graphs, motivated by properties of real-world networks, where the radius is roughly piece-wise constant, and the sampling density is smooth. We show experimentally that the NuG model can effectively model real-world graphs by training a graph autoencoder, where the encoder embeds the nodes in an underlying geometric space, and the decoder produces edges according to the NuG model. Moreover, we show that using our nonuniform geometric GSOs with learned sampling density in spectral graph neural networks improves downstream tasks. Finally, we present proof-of-concept applications in which we use the learned density to improve pooling and extract knowledge from graphs.

2. NON-UNIFORM GEOMETRIC MODELS

In this section, we define non-uniform geometric GSOs, and a subclass of such GSOs called geometric graphs with hubs. To compute such GSOs from the data, we show how to estimate the sampling density from a given graph using self-supervision.

2.1. GRAPH SHIFT OPERATORS AND KERNEL OPERATORS

We denote graphs by G = (V, E), where V is the set of nodes,|V| is the number of nodes, and E is the set of edges. A one-dimensional graph signal is a mapping u : V → R. For a higher feature dimension F ∈ N, a signal is a mapping u : V → R F . In graph data science, typically, the data comprises only the graph structure G and node/edge features u, and the practitioner has the freedom to design a graph shift operator (GSO). Loosely speaking, given a graph G = (V, E), a GSO is any matrix L ∈ R |V|×|V| that respects the connectivity of the graph, i.e., L i,j = 0 whenever (i, j) / ∈ E, i ̸ = j (Mateos et al., 2019) . GSOs are used in graph signal processing to define filters, as functions of the GSO of the form f (L), where f : R → R is, e.g., a polynomial (Defferrard et al., 2016) or a rational (Levie et al., 2019) function. The filters operate on graph signals u by f (L)u. Spectral graph convolutional networks are the class of graph neural networks that implement convolutions as filters. When a spectral graph convolutional network is trained, only the filters f : R → R are learned. One significant advantage of the spectral approach is that the convolution network is not tied to a specific graph, but can rather be transferred between different graphs of different sizes and topologies. In this work, we see GSOs as randomly sampled from kernel operators defined on underlying geometric spaces. The underlying spaces are modelled as metric spaces. To allow modeling the random sampling of points, each metric space is also assumed to be a probability space. Definition 1. Let (S, d, µ) be a metric-probability spacefoot_0 with probability measure µ and metric d; let m ∈ L ∞ (S) 2 ; let K ∈ L ∞ (S × S). The metric-probability Laplacian L = L K,m is defined as L : L ∞ (S) → L ∞ (S) , (Lu)(x) = S K(x, y) u(y) dµ(y) -m(x) u(x) . For example, let S be a Riemannian manifold, and take K(x, y) = 1 Bα(x) (y)/µ(B α (x)) and m(x) = 1, where B α (x) is the ball or radius α about x. In this case, the operator L K,m approximates the Laplace-Beltrami operator when α is small (Burago et al., 2019) . A random graph is generated by randomly sampling points from the metric-probability space (S, d, µ). As a modeling assumption, we suppose the sampling is performed according to a measure ν. We assume ν is a weighted measure with respect to µ, i.e., there exists a density function ρ : S → (0, ∞) such that dν(y) = ρ(y) dµ(y)foot_2 . We assume that ρ is bounded away from zero and infinity. Using a change of variable, it is easy to see that (Lu)(x) = S K(x, y) ρ(y) -1 u(y) dν(y) -m(x) u(x) . Let x = {x i } N i=1 be a random independent sample from S according to the distribution ν. The corresponding sampled GSO L is defined by L i,j = N -1 K(x i , x j )ρ(x j ) -1 -m(x i ) . (2) Given a signal u ∈ L ∞ (S), and its sampled version u = {u(x i )} N i=1 , it is well known that (Lu) i approximates Lu(x i ) for every i ∈ {1, . . . , N } (Hein et al., 2007; von Luxburg et al., 2008) .

2.2. NON-UNIFORM GEOMETRIC GSOS

According to (2), a GSO L can be directly sampled from the metric-probability Laplacian L. However, such an approach would violate our motivating guidelines, since we are interested in GSOs that can be computed directly from the graph structure, without explicitly knowing the underlying continuous kernel and density. In this subsection, we define a class of metric-probability Laplacians that allow such direct sampling. For that, we first define a model of adjacency in the metric space. Definition 2. Let (S, d, µ) be a metric-probability space. Let α : S → (0, +∞) be a non-negative measurable function named neighborhood radius. The neighborhood model N is defined as the setvalued function that assigns to each x ∈ S the ball N (x) := {y ∈ S : d(x, y) ≤ max α(x), α(y) }. Since y ∈ N (x) implies x ∈ N (y) for all x, y ∈ S, Def. 2 models only symmetric graphs. Next, we define a class of continuous Laplacians based on neighborhood models. Definition 3. Let (S, d, µ) be a metric-probability space, and N a neighborhood model as in Def. 2. Let m (i) : R → R be a continuous function for every i ∈ {1, . . . , 4}. The metric-probability Laplacian model is the kernel operator L N that operates on signals u : S → R by (L N u) (x) := N (x) m (1) µ N (x) m (2) µ N (y) u(y) dµ(y) - N (x) m (3) µ N (x) m (4) µ N (y) dµ(y) u(x) . (3) In order to give a concrete example, suppose the neighborhood radius α (x) = α is a constant, m (1) (x) = m (3) (x) = x -1 , and m (2) (x) = m (4) (x) = 1, then (3) gives (L N u)(x) = 1 µ B α (x) Bα(x) u(y) dµ(y) -u(x) , which is an approximation of the Laplace-Beltrami operator. Since the neighborhood model of S represents adjacency in the metric space, we make the modeling assumption that graphs are sampled from neighborhood models, as follows. First, random independent points x = {x i } N i=1 are sampled from S according to the "non-uniform" distribution ν as before. Then, an edge is created between each pair x i and x j if x j ∈ N (x i ), to form the graph G. Now, a GSO can be sampled from a metric-probability Laplacian model L N by (2), if the underlying continuous model is known. However, such knowledge is not required, since the special structure of the metric-probability Laplacian model allows deriving the GSO directly from the sampled graph G and the sampled density {ρ(x i )} N i=1 . Def. 4 below gives such a construction of GSO. In the following, given a vector u ∈ R N and a function m : R → R, we denote by m(u) the vector {m(u i )} N i=1 , and by diag(u) ∈ R N ×N the diagonal matrix with diagonal u. Definition 4. Let G = (V, E) be a graph with adjacency matrix A; let ρ ρ ρ : V → (0, ∞) be a graph signal, referred to as graph density. The non-uniform geometric GSO is defined to be L G,ρ := N -1 D (1) ρ A ρ D (2) ρ -N -1 diag D (3) ρ A ρ D (4) ρ 1 , where A ρ = A diag(ρ ρ ρ) -1 and D (i) ρ = diag m (i) N -1 A ρ 1 1 1 . Def. 4 can retrieve, as particular cases, the usual GSOs, as shown in Tab. 3 in Appendix C. For example, in case of m (1) (x) = m (3) (x) = x -1 , m (2) (x) = m (4) (x) = 1, and uniform sampling ρ = 1, (4) leads to the random-walk Laplacian L G,1 = D -1 A -I. The non-uniform geometric GSO in Def. 4 is the Monte-Carlo approximation of the metric-probability Laplacian in Def. 3. This is shown in the following proposition, whose proof can be found in Appendix D. Proposition 1. Let G = (V, E) be a random graph with i.i.d. sample x = {x i } N i=1 from the metricprobability space (S, d, µ) with neighborhood structure N . Let L G,ρ ρ ρ be the non-uniform geometric GSO as in Def. 4. Let u ∈ L ∞ (S) and u = {u(x i )} N i=1 . Then, for every i = 1, . . . , N , E (L G,ρ ρ ρ u) i -(L N u)(x i ) 2 = O(N -1 ). In Appendix D we also show that, in probability at least 1 -p, it holds ∀ i ∈ {1, . . . , N } , |(L G,ρ ρ ρ u) i -(L N u)(x i )| = O N -1 2 log(1/p) + log(N ) . Prop. 1 means that if we are given a graph that was sampled from a neighborhood model, and we know (or have an estimate of) the sampling density at every node of the graph, then we can compute a GSO according to (4) that is guaranteed to approximate a corresponding unknown metric-probability Laplacian. The next goal is hence to estimate the sampling density from a given graph.

2.3. INFERRING THE SAMPLING DENSITY

In real-world scenarios, the true value of the sampling density is not known. The following result gives a first rough estimate of the sampling density in a special case. Lemma 1. Let (S, d, µ) be a metric-probability space; let N be a neighborhood model; let ν be a weighted measure with respect to µ with continuous density ρ bounded away from zero and infinity. There exists a function c : S → S such that c(x) ∈ N (x) and (ρ • c)(x) = ν N (x) /µ N (x) . The proof can be found in Appendix D. In light of Lemma 1, if the neighborhood radius of x is small enough, if the volumes µ(N (x)) are approximately constant, and if ρ does not vary too fast, the sampling density at x is roughly proportional to ν(N (x)), that is, the likelihood a point is drawn from N (x). Therefore, in this situation, the sampling density ρ(x) can be approximated by the degree of the node x. In practice, we are interested in graphs where the volumes of the neighborhoods µ(N (x)) are not constant. Still, a normalization of the GSO by the degree can soften the distortion introduced by non-uniform sampling, at least locally in areas where µ(N (x)) is slowly varying. This suggests that the degree of a node is a good input feature for a method that learns the sampling density from the graph structure and the node features. Such a method is developed next.

2.4. GEOMETRIC GRAPHS WITH HUBS

When designing a method to estimate the sampling density from the graph, the degree is not a sufficient input parameter. The reason is that the degree of a node has two main contributions: the sampling density and the neighborhood radius. The problem of decoupling the two contributions is difficult in the general case. However, if the sampling density is slowly varying, and if the neighborhood radius is piecewise constant, the problem becomes easier. Intuitively, a slowly varying sampling density causes a slight change in the degree of adjacent nodes. In contrast, a sudden change in the degree is caused by a radius jump. In time-frequency analysis and compressed sensing, various results guarantee the ability to separate a signal into its different components, e.g., piecewise constant and smooth components (Do et al., 2022; Donoho & Kutyniok, 2013; Gribonval & Bacry, 2003) . This motivates our model of geometric graphs with hubs. Definition 5. A geometric graph with hubs is a random graph with non-uniform geometric GSO, sampled from a metric-probability space (S, d, µ) with neighborhood model N , where the sampling density ρ is Lipschitz continuous in S and µ(N (x)) is piecewise constant. We call this model a geometric graph with hubs since we typically assume that µ(N (x)) has a low value for most points x ∈ S, while only a few small regions, called hubs, have large neighborhoods. In Section 3.1, we exhibit that geometric graphs with hubs can model real-world graphs. To validate this, we train a graph auto-encoder on real-world networks, where the decoder is restricted to be a geometric graph with hubs. The fact that such a decoder can achieve low error rates suggests that real-world graphs can often be modeled as geometric graphs with hubs. Geometric graphs with hubs are also reasonable from a modeling point of view. For example, it is reasonable to assume that different demographics join a social media platform at different rates. Since the demographic is directly related to the node features, and the graph roughly exhibits homophily, the features are slowly varying over the graph, and hence, so is the sampling density. On the other hand, hubs in social networks are associated with influencers. The conditions that make a certain user an influencer are not directly related to the features. Indeed, if the node features in a social network are user interests, users that follow an influencer tend to share their features with the influencer, so the features themselves are not enough to determine if a node is deemed to be a center of a hub or not. Hence, the radius does not tend to be continuous over the graph, and, instead, is roughly constant and small over most of the graph (non-influencers), except for some narrow and sharp peaks (influencers).

2.5. LEARNING THE SAMPLING DENSITY

In the current section, we propose a strategy to assess the sampling density ρ ρ ρ. As suggested by the above discussion, the local changes in the degree of the graph give us a lot of information about the local changes in the sampling density and neighborhood radius of geometric graphs with hubs. Hence, we implement the density estimator as a message-passing graph neural network (MPNN) Θ because it performs local computations and it is equivariant to node indexing, a property that both the density and the degree satisfy. Since we are mainly interested in estimating the inverse of the sampling density, Θ takes as input the inverse of the degree and the inverse of the mean degree of the one-hop neighborhood for all nodes in the graph as two input channels. However, it is not yet clear how to train Θ. Since in real-world scenarios the ground-truth density is not known, we train Θ in a self-supervised manner. In this context, we choose a task (link prediction, node or graph classification, etc.) on a real-world graph G and we solve it by means of a graph neural network Ψ, referred to as task network. Since we want Ψ to depend on the sampling density estimator Θ, we define Ψ as a spectral graph convolution network based on the non-uniform geometric GSO L G,Θ(G) , e.g., GCN (Kipf & Welling, 2017) , ChebNet (Defferrard et al., 2016) or CayleyNet (Levie et al., 2019) . We, therefore, train Ψ end-to-end on the given task. The idea behind the proposed method is that the task depends mostly on the underlying continuous model. For example, in shape classification, the label of each graph depends on the surface from which the graph is sampled, rather than the specific intricate structure of the discretization. Therefore, the task network Ψ can perform well if it learns to ignore the particular fine details of the discretization, and focus on the underlying space. The correction of the GSO via the estimated sampling density (( 4)) gives the network exactly such power. Therefore, we conjecture that Θ will indeed learn how to estimate the sampling density for graphs that exhibit homophily. In order to verify the previous claim, and to validate our model, we focus on link prediction on synthetic datasets (see Appendix B), for which the ground-truth sampling density is known. As shown in Fig. 1 , the MPNN Θ is able to correctly identify hubs, and correctly predict the ground-truth density in a self-supervised manner.

3. EXPERIMENTS

In the following, we validate the NuG model experimentally. Moreover, we verify the validity of our method first on synthetic datasets, then on real-world graphs in a transductive (node classification) and inductive (graph classification) setting. Finally, we propose proof-of-concept applications in explainability, learning GSOs, and differentiable pooling. Performances averaged across 10 runs on each value of the latent dimension.

3.1. LINK PREDICTION

The method proposed in Section 2.5 is applied on synthetic datasets of geometric graphs with hubs (see for details Appendices A.1 to A.2). In Fig. 1 , it is shown that Θ is able to correctly predict the value of the sampling density. The left plots of Figs. 1a and 1b show that the density is well approximated both at hubs and non-hubs. Looking at the right plots, it is evident that the density cannot be predicted solely from the degree. Fig. 2 and Fig. 7 in Appendix A.3 show that the NuG model is able to effectively represent real-world graphs, outperforming other graph auto-encoder methods (see Tab. 1 for the number of parameters of each method). Here, we learn an auto-encoder with four types of decoders: inner product, MLP, constant neighborhood radius, and piecewise constant neighborhood radius corresponding to a geometric graph with hubs (see Appendix A.3 for more details). Better performances are reached if the graph is allowed to be a geometric graph with hubs as in Def. 5. Moreover, the performances of distance and distance+hubs decoder seem to be consistent among different datasets, unlike the inner product and MLP decoders. This corroborates the claims that real-world graphs can be better modeled as geometric graphs with non-constant neighborhood radius. Fig. 8 in Appendix A.3 shows the learned probabilities of being a hub, and the learned values of α and β, for the Pubmed graph.

3.2. NODE CLASSIFICATION

Another exemplary application is to use a non-uniform geometric GSO L G,ρ (Def. 4) in a spectral graph convolution network for node classification tasks, where the density ρ i at each node i is computed by a different graph neural network, and the whole model is trained end-to-end on the task. The details are reported in Appendix A.2. In Fig. 3 we show the accuracy of the best-scoring GSO out of the ones reported in Tab. 3 when the density is ignored against the best-scoring GSO when the sampling density is learned. For Citeseer and FacebookPagePage, the best GSOs are the symmetric normalized adjacency matrix. For Cora and Pubmed, the best density-ignored GSO is the symmetric Comparison when the importance ρ ρ ρ -1 is ignored (I), used to correct the Laplacian (L), or used for pooling (P). Each point represents the performance at one run. In (a) the best performances are reached when ρ ρ ρ -1 is used to correct the Laplacian, and in (b) when ρ ρ ρ -1 is used for pooling. normalized adjacency matrix, while the best density-normalized GSO is the adjacency matrix. For AmazonComputers and AmazonPhoto, the best-scoring GSOs are the symmetric normalized Laplacian. This validates our analysis: if the sampling density is ignored, the best choice is to normalize the Laplacian by the degree to soften the distortion of non-uniform sampling.

3.3. GRAPH CLASSIFICATION & DIFFERENTIABLE POOLING

In this experiment we perform graph classification on the AIDS dataset (Riesen & Bunke, 2008) , as explained in Appendix A.2. Fig. 4 shows that the classification performances of a spectral graph neural network are better if a quota of parameters is used to learn ρ ρ ρ which is used in a non-uniform geometric GSO (Def. 4). The learnable ρ ρ ρ on the AIDS dataset can be used not only to correct the Laplacian but also to perform a better pooling (see Appendix A.2 for the details). Usually, a graph convolutional neural network is followed by a global pooling layer in order to extract a representation of the whole graph. A vanilla pooling layer aggregates uniformly the contribution of all nodes. We implemented a weighted pooling layer that takes into account the importance of each node. As shown in Fig. 4 , the weighted pooling layer can indeed improve performances on the graph classification task. Fig. 6 in Appendix A.2 shows a comparison between the degree, the density learned to correct the GSO and the density learned for pooling. From the plot it is clear that the degree cannot predict the density. Indeed, the sampling density at nodes with the same degree can have different values . computed as the number of compounds labeled as active (inactive) containing that particular element divided by the number of active (inactive) compounds. This is a measure of rarity. For example, potassium is present in 5 out of 400 active compounds, and in 1 over 1600 inactive compounds. Hence, it is more rare to find potassium in an inactive compound. (Right) The mean importance of each element when ρ ρ ρ -1 is used to correct the GSO (L, orange) and when it is used for weighted pooling (P, green). Carbon, oxygen, and nitrogen have low mean importance, which makes sense as they are present in almost every compound, as shown in the left plot. The chemical elements are sorted according to their mean importance when ρ ρ ρ -1 is used to correct the GSO (orange bars).

3.4. EXPLAINABILITY IN GRAPH CLASSIFICATION

In this experiment, we show how to use the density estimator for explainability. The inverse density vector ρ ρ ρ -1 can be interpreted as a measure of importance of each node, relative to the task at hand, instead of sampling density. Thinking about ρ ρ ρ -1 as importance is useful when the graph is not naturally seen as randomly generated from a graphon model. We applied this paradigm to the AIDS dataset, as explained in the previous subsection. The better classification performances when ρ ρ ρ is learned demonstrates that ρ ρ ρ is an important feature for the classification task, and hence, it can be exploited to extract knowledge from the graph. We define the mean importance of each chemical element e as the sum of all values of ρ ρ ρ -1 corresponding to nodes labeled as e divided by the number of nodes labeled e. Fig. 5 shows the mean importance of each element, when ρ ρ ρ -1 is estimated by using it as a module in the task network in two ways. (1) The importance ρ ρ ρ -1 is used to correct the GSO. (2) The importance ρ ρ ρ -1 is used in a pooling layer, that maps the output of the graph neural network Ψ to one feature of the form |V| j=1 ρ j -1 Ψ(X) j , where X denotes the node features. In both cases, the most important elements are the same; therefore, the two methods seem to be consistent.

CONCLUSIONS

In this paper, we addressed the problem of learning the latent sampling density by which graphs are sampled from their underlying continuous models. We developed formulas for representing graphs given their connectivity structure and sampling density using non-uniform geometric GSOs. We then showcased how the density of geometric graphs with hubs can be estimated using self-supervision, and validated our approach experimentally. Last, we showed how knowing the sampling density can help with various tasks, e.g., improving spectral methods, improving pooling, and gaining knowledge from graphs. One limitation of our methodology is the difficulty in validating that real-world graphs are indeed sampled from latent geometric spaces. While we reported experiments that support this modeling assumption, an important future direction is to develop further experiments and tools to support our model. For instance, can we learn a density estimator on one class of graphs and transfer it to another? Can we use ground-truth demographic data to validate the estimated density in social networks? We believe future research will shed light on those questions and find new ways to exploit the sampling density for various applications.

A IMPLEMENTATION DETAILS

A.1 SYNTHETIC DATASET GENERATION This section explains how to generate a synthetic dataset of geometric graphs with hubs. We first consider a metric space. For our experiments, we mainly focused on the unit-circle S 1 and on the unitdisk D (see Appendix B for more details). Each graph is generated as follows. First, a non-uniform distribution is randomly generated. We considered an angular non-uniformity as described in Def. 6, where the number of oscillating terms, as well as the parameters c c c, n n n, µ µ µ, are chosen randomly. In the case of 2-dimensional spaces, the radial distribution is the one shown in Tab. 2. According to each generated probability density function, N points {x i } N i=1 are drawn independently. Among them, m < N are chosen randomly to be hubs, and any other node whose distance from a hub is less than some ε > 0 is also marked as a hub. We consider two parameters α, β > 0. The neighborhood radius about non-hub (respectively, hubs) nodes is taken to be α (respectively α + β). Any two points are then connected if d(x i , x j ) ≤ max{r(x i ), r(x j )} , r(x) = α x is non-hub α + β x is hub . In practical terms, α is computed such that the resulting graph is strongly connected, hence, it differs from graph to graph; β is set to be 3 α and ϵ to be α/10.

A.2 DENSITY ESTIMATION WITH SELF-SUPERVISION

Density Estimation Network In our experiments, the inverse of the sampling density, 1/ρ, is learned by means of an EdgeConv neural network Θ (Wang et al., 2019) , which is referred to as PNet in the following, where the message function is a multi-layer perceptron (MLP), and the aggregation function is max(•), followed by a abs(•) non-linearity. The number of hidden layers, hidden channels, and output channels is 3, 32, and 1, respectively. Since the degree is an approximation of the sampling density, as stated in Lemma 1, and since we are interested in computing its inverse to correct the GSO, the input of PNet is the inverse of the degree and the inverse of the mean degree of the one-hop neighborhood. Justified by the Monte-Carlo approximation 1 = S dµ(y) = S ρ(y) -1 dν(y) ≈ N -1 N i=1 ρ(x i ) -1 , x i ∼ ρ ∀i = 1, . . . , N , the output of PNet is normalized by its mean.

Self-Supervision of PNet via Link Prediction on Synthetic Dataset

To train the PNet Θ, for each graph G, we use Θ(G) to define a GSO L G,Θ(G) . Then, we define a graph auto-encoder, where the encoder is implemented as a spectral graph convolution network with GSO L G,Θ(G) . The decoder is the usual inner-product decoder. The graph signal is a slice of 20 random columns of the adjacency matrix. The number of hidden channels, hidden layers, and output channels is respectively 32, 2, and 2. For each node j, the network outputs a feature Θ(G) j in R n . Here, R n is seen as the metric space underlying the NuG. In our experiments (Section 3.1), we choose n = 2. Some results are shown in Fig. 1 . Node Classification Let G be the real-world graph. In Section 3.2, we considered G to be one of the graphs reported in Tab. 1. The task network Ψ is a polynomial convolutional neural network implementing a GSO L G,Θ(G) , where Θ is the PNet; the order of the polynomial spectral filters is 1, the number of hidden channels 32, and the number of hidden layers 2; the GSOs used are the ones in Tab. 3. The optimizer is ADAM (Kingma & Ba, 2015) with learning rate 10 -2 . We split the nodes in training (85%), validation (5%), and test (10%) in a stratified fashion, and apply early stopping. The performances of the method are shown in Fig. 3 . Graph Classification Let G be the real-world graph. In Section 3.4, G is any compound in the AIDS dataset. The task network Ψ is a polynomial convolutional neural network implementing a GSO L G,Θ(G) , where Θ is the PNet; the order of the spectral polynomial filters is 1, the number of hidden channels 128 and the number of hidden layers 2. The optimizer is ADAM with learning rate 10 -2 . We perform a stratified splitting of the graphs in training (85%), validation (5%), and test (10%), and applied early stopping. The chosen batch size is 64. The pooling layer is a global add layer. In case of weighted pooling as in Section 3.3, the task network Ψ implements as GSO L G,1 1 1 , while Θ is used to output the weights of the pooling layer. The performance metrics of both approaches are shown in Fig. 4 A.3 GEOMETRIC GRAPHS WITH HUBS AUTO-ENCODER Here, we validate that real-world graphs can be modeled approximately as geometric graphs with hubs, as claimed in Section 3.1. We consider the datasets listed in Tab. 1. The auto-encoder is defined as follows. Let G be the real-world graph with N nodes and F node features; let X ∈ R N ×F be the feature matrix. Let n be the dimension of the metric space in which nodes are embedded. Let Ψ be a spectral graph convolutional network, referred to as encoder. Let Ψ(X) i and Ψ(X) j ∈ R n be the embedding of nodes i and j respectively. A decoder is a mapping R n × R n → [0, 1] that takes as input the embedding of two nodes i, j and returns the probability that the edge (i, j) exists. We use four types of decoders. (1) The inner product decoder from Kipf & Welling (2016) is defined as σ ⟨Ψ(X) i , Ψ(X) j ⟩ , where σ(•) is the logistic sigmoid function. (2) The MLP decoder is defined as σ MLP([Ψ(X) i , Ψ(X) j ]) , where [Ψ(X) i , Ψ(X) j ] ∈ R 2 n denotes the concatenation of Ψ(X) i and Ψ(X) j , and MLP denotes a multi-layer perceptron. (3) The distance decoder corresponds to geometric graphs. It is defined as σ(α -∥Ψ(X) i -Ψ(X) j ∥ 2 ), where α is the trainable neighborhood radius. (4) The distance+hubs decoder corresponds to geometric graphs with hubs. It is defined as σ(α + max{Υ( D) i , Υ( D) j }β -∥Ψ(X) i -Ψ(X) j ∥ 2 ), where α, β are trainable parameters that describe the radii of hubs and non-hubs. Υ is a message-passing graph neural network (with the same architecture of PNet) that takes as input a signal D computed from the node degrees (i.e., the inverse of the degree and the inverse of the mean degree of the one-hop neighborhood), and outputs the probability that each node is a hub. Υ is learned end-to-end together with the rest of the auto-encoder. In order to guarantee that 0 ≤ Υ(G) j ≤ 1 the network is followed by a min-max normalization. The distance decoder is justified by the fact that the condition ∥Ψ(X) i -Ψ(X) j ∥ 2 ≤ α can be rewritten as H(α -∥Ψ(X) i -Ψ(X) j ∥ 2 ), where H(•) is the Heaviside function. The Heaviside function is relaxed to the logistic sigmoid for differentiability. Similar reasoning lies behind the formula of the distance+hubs decoder. The encoder Ψ is a polynomial spectral graph convolutional neural network implementing as GSO the symmetric normalized adjacency matrix; the order of the polynomial filters is 1, the number of hidden channels 32 and the number of hidden layers 2. In the case of inner-product, MLP and distance decoder, the loss is the cross entropy of existing and non-existing edges. In the case of distance+hubs-decoder, we also add ∥Υ(G)∥ 1 /N to the loss, as a regularization term, since in our model we suppose the number of hubs is low. The optimizer is ADAM with learning rate 10 -2 . We split the edges in training (85%), validation (5%) and test (10%), and apply early stopping. The Table 1 : Real-world networks used for the link prediction task: graph statistics and number of parameters of the auto-encoder for each of the three decoder types: inner product, distance, distance+hubs and MLP. Since the number of input channels of the MLP decoder depends on the latent dimension n, we report the number of parameter for n = 3. The distance decoder has one learnable parameter more than the inner-product decoder. Since PNet has a fixed number of input channels, the distance+hubs decoder has 2, 535 learnable parameters more than the inner-product one. On the contrary, the mlp decoder has a number of input channels that depends on the latent dimension; therefore, the number of hidden channels is chosen to guarantee that the number of learnable parameters of the mlp decoder is approximately 2, 535.

B SYNTHETIC DATASETS -A BLUEPRINT

In the following, we consider some simple latent metric spaces and construct methods for randomly generating non-uniform samples. For each space, structural properties of the corresponding NuG are studied, such as the expected degree of a node and the expected average degree, in case the radius is fixed and the sampling is non-uniform. All proofs can be found in Appendix D, if not otherwise stated. Three natural metric measure spaces are the euclidean, spherical, and hyperbolic spaces. If we restrict the attention to 2-dimensional spaces, a way to uniformly sample is summarized in Tab. 2. In all three cases, the radial component arises naturally from the measure of the space. A possible way to introduce non-uniformity is changing the angular distribution. In this way, preferential directions will be identified, leading to an anisotropic model. averaged across 10 runs for each value of the latent dimension. The average probability of being a hub is 19.06%, and the number of nodes with a probability of being a hub greater than 0.99 is 10.10%. Table 2 : Properties of euclidean, spherical and hyperbolic spaces of dimension 2. In the case of euclidean and hyperbolic spaces, the uniform distribution refers to a disk of radius R.

Property Geometry

Measure of a ball of radius α euclidean π α 2 spherical 2 π (1 -cos(α)) hyperbolic 2 π (cosh(α) -1) Uniform p.d.f. euclidean (2 π) -1 1 [-π,π) (θ)2R -2 r1 [0,R) (r) spherical (2 π) -1 1 [-π,π) (θ)2 -1 sin(φ)1 [0,π) (φ) hyperbolic (2 π) -1 1 [-π,π) (θ)(cosh(R) -1) -1 sinh(r)1 [0,R) (r) Distance in polar coordinates euclidean r 2 1 + r 2 2 -2r 1 r 2 cos(θ 1 -θ 2 ) spherical arccos cos(ϕ 1 ) cos(ϕ 2 ) + sin(ϕ 1 ) sin(ϕ 2 ) cos(θ 1 -θ 2 ) hyperbolic arccosh(cosh(r 1 ) cosh(r 2 ) -sinh(r 1 ) sinh(r 2 ) cos(θ 1 -θ 2 )) Definition 6. Given a natural number C ∈ N, and vectors c c c ∈ R C , n n n ∈ N C , µ µ µ ∈ R C the function sb(θ; c c c, n n n, µ µ µ) = 1 B C i=1 c i cos(n i (θ -µ i )) + A B , A = C i=1 |c i | , B = 2 π   C i=1 |c i | + i:ni=0 c i   , is a continuous, 2π-periodic probability density function. It will be referred to as spectrally bounded. The cosine can be replaced by a generic 2π-periodic function; the only change in the construction will be the offset and the normalization constant. Definition 7. Given a natural number C ∈ N, and the vectors c c c ∈ R C , n n n ∈ N C , µ µ µ ∈ R C , κ κ κ ∈ R C ≥0 , the function mvM(θ; c c c, n n n, µ µ µ, κ κ κ) = 1 B C i=1 c i exp(κ i cos(n i (θ -µ i )) 2 π I 0 (κ i ) + A B , where A = i:ci<0 c i exp(κ i ) 2 π I 0 (κ i ) , B = i:ni≥1 c i + i:ni=0 c i exp(κ i ) I 0 (κ i ) + i:ci<0 c i exp(κ i ) I 0 (κ i ) , is a continuous, 2π-periodic probability density function. It will be referred to as multimodal von Mises. Both densities introduced previously can be thought of as functions over the unit circle. Hence, the very first space to be studied is S 1 = {x ∈ R 2 : ∥x∥ = 1} equipped with geodesic distance. As shown in the next proposition, the geodesic distance can be computed in a fairly easy way. Proposition 2. Given two points x x x, y y y ∈ S 1 corresponding to the angles x, y ∈ [-π, π), their geodesic distance is equal to d(x x x, y y y) = π -|π -|x -y|| . The next proposition computes the degree of a node in a non-uniform unit circle graph. Proposition 3. Given a spectrally bounded probability density function as in Def. 6, the expected degree of a node θ in a unit circle geometric graph with neighborhood radius α is deg(θ) = 2 N B    i:ni̸ =0 c i n i cos(n i (θ -µ i )) sin(n i α) +   i:ni=0 c i + A   α    , and the expected average degree of the whole graph is E[deg(θ)] = 2 π N α B 2    i:ni̸ =0 j:ni=nj c i c j cos(n i (µ i -µ j )) sin(n i α) n i α + 2   i:ni=0 c i + A   2    . As a direct consequence, in the limit of r going to zero lim α→0 + P[B α (θ)] 2 α = 1 B    i:ni̸ =0 c i cos(n i (θ -µ i )) lim α→0 + sin(n i α) n i α +   i:ni=0 c i + A      = sb(θ; c c c, n n n, µ µ µ) thus, for sufficiently small α, the probability of a ball centered at θ is proportional to the density computed in θ. Moreover, the error can be computed as sb(θ; c c c, n n n, µ µ µ) - P[B α (θ)] 2 α ≤ 1 6 B   C i=1 n 2 i c i   α 2 , which shows that the approximation worsens the more oscillatory terms there are. In the case of multimodal von Mises distribution, a closed formula for the probability of balls does not exist. The following proposition introduces an approximation based solely on cosine functions. Proposition 4. A multimodal von Mises probability density function can be approximated by a spectrally bounded one. The previous result, combined with Prop. 3, gives a way to approximate the expected degree of spatial networks sampled accordingly to a multimodal von Mises angular distribution. However, the computation is straightforward when n n n is the constant vector n 1 1 1, since the product of two von Mises pdf is the kernel of a von Mises pdf exp(κ κ κ 1 cos(n(θ -µ µ µ 1 ))) 2 π I 0 (κ κ κ 1 ) exp(κ κ κ 2 cos(n(θ -µ µ µ 2 ))) 2 π I 0 (κ κ κ 2 ) = exp κ κ κ 2 1 + κ κ κ 2 2 + 2κ κ κ 1 κ κ κ 2 cos(n(µ µ µ 1 -µ µ µ 2 )) cos n(θ -φ) 4 π 2 I 0 (κ κ κ 1 ) I 0 (κ κ κ 2 ) , where φ = n -1 arctan κ κ κ 1 sin(n µ µ µ 1 ) + κ κ κ 2 sin(n µ µ µ 2 ) κ κ κ 1 cos(n µ µ µ 1 ) + κ κ κ 2 cos(n µ µ µ 2 ) . The unit circle model is preparatory to the study of more complex spaces, for instance, the unit disk D = {x ∈ R 2 : ∥x∥ ≤ 1} equipped with geodesic distance, as in Tab. 2. Proposition 5. Given a spectrally bounded angular distribution as in Def. 6, the degree of a node (r, θ) in a unit disk geometric graph with neighborhood radius α is deg(r, θ) ≈ 2 π α 2 N sb(θ; c c c, n n n, µ µ µ) , and the average degree of the whole network is E[deg(r, θ)] ≈ 2 π 2 α 2 N B 2    i:ni̸ =0 j:ni=nj c i c j cos(n i (µ i -µ j )) + 2   i:ni=0 c i + A   2    . Fig. 9a shows some examples of non-uniform sampling of the unit disk. The last example will be the hyperbolic disk with radius R ≫ 1, equipped with geodesic distance as in Tab. 2. Proposition 6. Given a spectrally bounded angular distribution as in Def. 6, the degree of a node (r, θ) in a hyperbolic geometric graph with neighborhood radius α is ). The proof can be found in Appendix D. The computed approximation is in line with the findings of Krioukov et al. (2010) , where a closed formula for the uniform case is provided when α = R. To the best of our knowledge, this is the first work that considers α ̸ = R. Examples of non-uniform sampling of the hyperbolic disk are shown in Fig. 9b .

C RETRIEVING AND BUILDING GSOS

In the current section, we first show how to retrieve the usual definition of graph shift operators from Def. 4, and then how Def. 4 can be used to create novel GSOs. For simplicity, for both goals we suppose uniform sampling ρ ρ ρ = 1; (4) can be rewritten as L G,1 1 1 = N -1 diag m (1) N -1 d A diag m (2) N -1 d -N -1 diag diag m (3) N -1 d A diag m (4) N -1 d 1 (7) where A is the adjacency matrix and d is the degree vector. Tab. 3 exhibit which choice of {m (i) } 4 i=1 correspond to which graph Laplacian. A question that may arise is whether the innermost diag(•) in ( 7) can be factored out of the outermost one. As shown in the next proposition, it is not possible in general. (Cvetkovic & Simic, 2009) 1(x) 1(x) -1(x) 1(x) (1) (x) m (2) (x) m (3) (x) m (4) (x) Adjacency 1(x) 1(x) 0(x) 0(x) Combinatorial Laplacian 1(x) 1(x) 1(x) 1(x) Signless Laplacian

Random walk Laplacian

x -1 1(x) x -1 1(x) Right normalized Laplacian 1(x) x -1 x -1 1(x) Symmetric normalized adjacency (Kipf & Welling, 2017) x -1 2 x -1 2 0(x) 0(x) Symmetric normalized Laplacian x -1 2 x -1 2 x -1 1(x) Equation (8) x -1 2 x -1 2 x -1 2 x -1 2 Example 1 (Complete Bipartite Graph). Consider the complete bipartite graph with n nodes in the first part and m ≥ n nodes in the second part. Its adjacency and degree matrix are A = 0 n×n 1 n×m 1 m×n 0 m×m , D = mI n×n 0 n×m 0 m×n nI m×m . A simple computation leads to L = D -1 2 AD -1 2 -diag D -1 2 AD -1 2 1 = -m 1 2 n -1 2 I n×n (nm) -1 2 1 n×m (nm) -1 2 1 m×n -n 1 2 m -1 2 I m×m . It can be noted that L has null eigenvalue λ 1 = 0 corresponding to the constant eigenvector 1 n+m . vector v i = -e 1 + e i , i ∈ {2, . . . , n} is an eigenvector with eigenvalue λ 2 = -m/n, whose multiplicity is n -1. Analogously, v i = -e n+1 + e i+1 , i ∈ {n + 1, . . . , n + m -1} is an eigenvector with eigenvalue λ 3 = -n/m, whose multiplicity is m -1. Finally, the vector v n+m = [-m/n1 T n , 1 T m ] T is eigenvector with eigenvalues λ 4 = λ 2 + λ 3 . Therefore, the spectral radius of L is |λ 4 | = m + n √ m n . In the case of a balanced graph, n = m implies that the spectral radius is 2. In the case of a star graph, n = 1 and |λ 4 | = O( √ m) as m → ∞; therefore, the asymptotic in Prop. 8 is tight.

D PROOFS

Proof of Prop. 1 and concentration of error. Let x = {x i } N i=1 be an i.i.d. random sample from ρ. Let K and m be the kernel and diagonal parts corresponding to the metric-probability Laplacian L N . Let L, u be L i,j = N -1 K(x i , x j )ρ(x j ) -1 -m(x i ) , u i = u(x i ) . Note that the non-uniform geometric GSO L G,ρ ρ ρ based on the graph G, which is randomly sampled from S with neighborhood model N via the sample points x, is exactly equal to L. Conditioned on x i = x, the expected value is E (L u) i = N -1 N j=1 E K(x, x k ) ρ(x j ) -1 u(x j ) -m(x)u(x) = L N u(x) . Since the random variables {x j } N j=1 are i.i.d. to y, then also the random variables K(x, x j )ρ(x j ) -1 u(x j ) N j=1 are i.i.d., hence, var (Lu) i = var   N -1 N j=1 K(x, x j )ρ(x j ) -1 u(x j ) -m(x) u(x)   = N -1 var K(x, y)ρ(y) -1 u(y) ≤ N -1 E K(x, y)ρ(y) -1 u(y) 2 = N -1 S K(x, y)ρ(y) -1 u(y) 2 ρ(y) dµ(y) ≤ N -1 K(x, •) 2 ρ(•) -1 L ∞ (S) ∥u∥ 2 L 2 (S) , which proves (5). Next, we prove the concentration of error result. We know that there exist a, b > 0 such that almost everywhere K(x, x j )ρ(x j ) -1 u(x j ) ∈ [a, b], since K, 1/ρ and u are essentially bounded. By Hoeffding's inequality, for t > 0, P |(Lu) i -L N u(x)| ≥ t ≤ 2 exp - 2 N t 2 (b -a) 2 . Setting p N = 2 exp - 2 N t 2 (b -a) 2 , solving for t, we obtain that for every node there is an event with at least 1 -p/N such that |(Lu) i -L N u(x)| ≤ 2 -1 2 (b -a)N -1 2 log(2 N p -1 ) . We then intersect all of these events to obtain an event of probability at least 1 -p that satisfies (6). Proof of Lemma 1. By hypothesis, there exist m x , M x > 0 such that m x ≤ ρ(y) ≤ M x for all y ∈ N (x). Therefore, M -1 x N (x) dν(y) ≤ N (x) dµ(y) = N (x) ρ(y) -1 dν(y) ≤ m -1 x N (x) dν(y) , from which m x ≤ N (x) dν(y) N (x) dµ(y) ≤ M x . By the Intermediate Value Theorem, there exists c x ∈ N (x) such that ρ(c x ) = N (x) dν(y) N (x) dµ(y) , from which the thesis follows. Proof of Prop. 2. Consider the map φ : [-π, π) → S 1 , θ → cos(θ), sin(θ) T , and the angles x, y, ∈ [-π, π) such that φ(x) = x x x, φ(y) = y y y, it holds d(x x x, y y y) = arccos(x x x T y y y) = arccos(cos(x) cos(y) + sin(x) sin(y)) = arccos(cos(x -y)) = x -y + 2 k π =          2 π + x -y , x -y ∈ [-2 π, -π) y -x , x -y ∈ [-π, 0) x -y , x -y ∈ [0, π) 2 π + y -x , x -y ∈ [π, 2 π) = 2 π -|x -y| , |x -y| > π |x -y| , |x -y| < π = π -|π -|x -y|| . Proof of Prop. 3. The expected degree of a node θ is the probability of the ball centered at θ times the size N of the sample. The probability of a ball can be computed by noting that θc+α θc-α cos(n i (θ -µ i )) dθ =      2 α , n i = 0 2 cos(n i (θ c -µ i )) sin(n i α) n i , Therefore, the average degree can be computed as d = N π -π P[B α (θ)] sb(θ; c c c, n n n, µ µ µ) dθ . The inspection of sb(θ; c c c, n n n, µ µ µ) and P[B α (θ)] shows that the only terms surviving integration are the constant term and the product of cosines with the same frequency π -π cos(n i (θ -µ i )) cos(n i (θ -µ i )) dθ = π cos(n i (µ j -µ i )), n i = n j 0, n i ̸ = n j from which the thesis follows. Proof of Prop. 4. Using Taylor expansion, it holds exp(κ i cos(n i (θ -µ i ))) = ∞ m=0 κ m i m! cos(n i (θ -µ i )) m = 1 + ∞ m=1 κ 2m i (2m)! cos(n i (θ -µ i )) 2m + ∞ m=1 κ 2m-1 i (2m -1)! cos(n i (θ -µ i )) 2m-1 . A first approximation can be made noting that cos(x) 2m ≤ cos(x) 2 and cos(x) 2m-1 ≈ cos(x) for all m ≥ 1, obtaining exp(κ i cos(n i (θ -µ i ))) ≈ 1 + (cosh(κ i ) -1) cos(n i (θ -µ i )) 2 + sinh(κ i ) cos(n i (θ -µ i )) . Such approximation deteriorates fast when κ i increases. A more refined approximation is obtained considering the power of cosine with higher coefficient in the Taylor expansion. Using Stirling's approximation of factorial, it can be shown that κ m i m! ≈ 1 √ 2 π m κ i e m m . In order to make the computation easier, suppose κ i is an integer; When m = κ i + 1, it holds 1 2 π (κ i + 1) κ i e κ i + 1 κi+1 = e κi+1 2 π (κ i + 1) κ i κ i + 1 κi+1 < e κi 2 π (κ i + 1) < e κi √ 2 π κ i , where the first inequality is justified by the fact that (κ i /(κ i + 1)) κi+1 is an increasing sequence that tends to 1/e. The previous formula shows that the coefficient with m = κ i + 1 is always smaller than the coefficient with m = κ i . The same reasoning can be applied to all the coefficients with m > κ i . Suppose now κ i ≥ 3, if m ≤ κ i -2 the previous reasoning holds. A peculiarity happens when m = κ i -1: 1 2 π (κ i -1) κ i e κ i -1 κi-1 = e κi-1 √ 2 π κ i κ i κ i -1 κi-1 2 > e κi √ 2 π κ i , because the sequence (κ i /(κ i -1)) κi-1/2 is decreasing; therefore m = κ i -1 is the point of maximum, and m = κ i is the second largest value. Therefore, the following approximation for exp(κ i cos(n i (θ -µ i ))) holds:      1 + (cosh(κ i ) -1) cos(n i (θ -µ i )))) 2 + sinh(κ i ) cos(n i (θ -µ i )))) , κ i ≤ 1 1 + (cosh(κ i ) -1) cos(n i (θ -µ i )))) κi + sinh(κ i ) cos(n i (θ -µ i )))) κi-1 , κ i ≥ 1, even 1 + (cosh(κ i ) -1) cos(n i (θ -µ i )))) κi-1 + sinh(κ i ) cos(n i (θ -µ i )))) κi , κ i ≥ 1, odd The thesis follows from the equality cos(n i (θ -µ i )) κi = 1 2 κi κi k=0 κ i k cos((2 k -κ i )n i (θ -µ i )) . Proof of Prop. 5. The domain of integration can be parametrized as  d D ((r, θ), (r c , θ c )) ≤ α, leading to θ ∈   θ c -arccos r 2 + r 2 c -α 2 2 r r c , θ c + arccos r 2 + r 2 c -α 2 2 r r c   . Three cases must be discussed: (1) 0 ≤ r c -α ≤ r c + α ≤ 1, (2) r c -α < 0, (3) r c + α > 1. In scenario (1), the ball B α (r c , θ c ) is contained in D. i:ni̸ =0 c i n i cos(n i (θ c -µ i )) rc+α rc-α r sin   n i arccos r 2 + r 2 c -α 2 2 r r c   dr + 4 B   i:ni=0 c i + A   rc+α rc-α r arccos r 2 + r 2 c -α 2 2 r r c dr , where the last equality comes from Prop. 3. For simplicity, define f ni-1 (r) = r 1 - r 2 + r 2 c -α 2 2 r r c 2 U ni-1 r 2 + r 2 c -α 2 2 r r c , g(r) = r arccos r 2 + r 2 c -α 2 2 r r c , where U k is the k-th Chebyshev polynomial of second kind. It is worthy to note that f ni-1 (r c + α) = 0, f ni-1 (r c -α) = 0, f ni-1 (α -r c ) = 0 and f ni-1 (r c ) = α 1 - α 2 r c 2 U ni-1 1 - α 2 2 r 2 c , f ni-1 (α) = α 1 - r c 2 α U ni-1 r c 2 α , while g(r c + α) = 0, g(r c -α) = 0, g(α -r c ) = (α -r c )π and g(r c ) = r c arccos 1 - α 2 2 r 2 c , g(α) = α arccos r c 2 α . The integral in (9) can be approximated by the semi-area of an ellipse having α and f ni-1 (r c ) (respectively g(r c )) as axes rc+α rc-α f ni-1 (r) dr ≈ π 2 αf ni-1 (r c ) , rc+α rc-α g(r) dr ≈ π 2 αg(r c ) , that can be seen as a modified version of Simpson's rule since the latter would lead to a coefficient of 4/3 instead of π/2. A comparison between the two methods is shown in Appendix D. In scenario (2) the domain of integration contains the origin and the argument of arccos in Appendix D could be not well defined. The singularity can be removed by decomposing the domain of integration as the union of a disk of radius α -r c around the origin and the remaining. Hence The three scenarios can be summarized in one big formula. For simplicity, define the operator From U k (1) = k the following approximation can be derived I[f ](r c ) = π 2 α + r c + min{α -r c , r c -α} 2 f ni-1 α + r c + max{α -r c , r c -α} 2 - max{r c + α -1, 0} r c + α -1 f ni-1 (r c ) 2 α arccos 1 -r c α - 1 -r c α α 2 -(1 -r c ) 2 , I[f ni-1 ](r) ≈ π 2 n i α 2 1 - α 2 r 2 ∼ π 2 n i α 2 1 - α 2 8 r 2 , hence the integral boils down to In order to remove the singularity of the argument of arccos, the domain of integration can be decomposed as a ball containing the origin and the remaining, leading to Let d i = D i,i the degree of the i-th node, using the symmetry of A, the numerator can be rewritten as i,j v 2 i d i d j A i,j - i,j v i A i,j v j = 1 2 i,j v 2 i d i d j A i,j + 1 2 i,j v 2 j d j d i A i,j - i,j v i A i,j v j = 1 2    i,j v i A i,j   d i d j v i -v j   - i,j v j A i,j v i - d j d i v j    = 1 2 i,j v i d j - v j √ d i A i,j d i v i -d j v j = 1 2 i,j A i,j d i d j d i v i -d j v j 2 . From the last equality follows that the eigenvalues are all positive. From (a -b) 2 ≤ 2(a 2 + b 2 ) follows ≤ i,j A i,j d i d j d i v 2 i + d j v 2 j = 2 i,j A i,j d i d j v 2 i ≤ 2 √ N i d i v 2 i = 2 √ N v T Dv , from which the thesis follows.



A metric-probability space is a triple (S, d, µ), where S is a set of points, and µ is the Borel measure corresponding to the metric d. A function g : S → R is an element of L ∞ (S) iff. ∃M < ∞ : µ({x ∈ S : |g(x)| > M }) = 0. The norm in L ∞ (S)is the essential supremum, i.e. inf{M ≥ 0 : |g(x)| ≤ M for almost every x ∈ S}. Formally, ν is absolutely continuous with respect to µ, with Radon-Nykodin derivative ρ.



D geometric graph with hubs.

Figure 1: Example of the learned probability density function in link prediction, where the underlying metric space is (a) the unit-circle, and (b) the unit disk. (Left) Ground-truth sampling density vs. learned sampling density at the nodes. (Right) Degree vs. learned sampling density.

Figure 2: Test AUC for link prediction task as a function of the dimension of the latent space. Performances averaged across 10 runs on each value of the latent dimension.

Figure 3: Test accuracy on node classification task. Comparison between the best scoring GSOs when the density is ignored (I) or learned (L). Results averaged across 10 runs: each point represents the performance at one run

Figure5: (Left) Distribution of chemical elements per class (active, inactive respectively in blue, red) computed as the number of compounds labeled as active (inactive) containing that particular element divided by the number of active (inactive) compounds. This is a measure of rarity. For example, potassium is present in 5 out of 400 active compounds, and in 1 over 1600 inactive compounds. Hence, it is more rare to find potassium in an inactive compound. (Right) The mean importance of each element when ρ ρ ρ -1 is used to correct the GSO (L, orange) and when it is used for weighted pooling (P, green). Carbon, oxygen, and nitrogen have low mean importance, which makes sense as they are present in almost every compound, as shown in the left plot. The chemical elements are sorted according to their mean importance when ρ ρ ρ -1 is used to correct the GSO (orange bars).

Figure 6: Comparison between degree, density learnt to correct the GSO ρ L , and density learnt to perform weighted pooling ρ P , AIDS dataset.

Figure 7: Test AP for link prediction task as a function of the dimension of the latent space. Performances averaged across 10 runs on each value of the latent dimension.

Figure 8: (Top left) Embedding of Pubmed in 2 dimensions using a distance+hubs decoder. The intensity of the color for each node i is proportional to the probability p i = Υ(G) i of being a hub. The three colours (red, green and blue) corresponds to the three different classes to which a node can belong, as reported in Tab. 1. (Bottom left) Histogram of the probabilities p = Υ(G) of being hub divided per class. (Right) Learned values of the radius parameters α (top) and β (bottom) of the geometric graph with hubs auto-encoder on Pubmed, as a function of the latent dimension. Results averaged across 10 runs for each value of the latent dimension. The average probability of being a hub is 19.06%, and the number of nodes with a probability of being a hub greater than 0.99 is 10.10%.

c c c, n n n, µ µ µ) , and the average degree of the whole network is O(N e α-2R 2

The probability of the ball can be computed asP B α (r c , θ c )

B α (r c , θ c ) = 2 c c c, n n n, µ µ µ) dθ dr = (α -r c ) sb(θ; c c c, n n n, µ µ µ) dθ dr .The same reasoning as before leads to the approximations In scenario (3) the domain of integration partially lies outside D. HenceP B α (r c , θ c ) (1 -r c ) 2 + α arcsin 1 -r c α .

that given a function f returns the ellipse approximation of the integral over balls, it holdsP B α (r c , θ c ) i (θ c -µ i )) I[f ni-1 ](r c ) α -r c } 2 .from which the thesis follows. To compute the average degree of a spatial network from the unit disk, the quantity B α (r, θ) sb(θ; c c c, n n n, µ µ µ) dθ dr ,

Figure 10: Approximation of

Proof of Prop. 6. Similarly to what has been done in Prop. 5, the domain of integration can be parametrized as θ ∈ (θ c -θ r , θ c + θ r ) , θ r = arccos (d r ) , d r = cosh(r) cosh(r c ) -cosh(α) sinh(r c ) sinh(r) .

[B α (r c , θ c )] (r) sb(θ; c c c, n n n, µ µ µ) dθ dr sinh(r) sb(θ; c c c, n n n, µ µ µ) dθ dr ,wherel 1 = 0, u 1 = max{α -r c , 0}, l 2 = |α -r c | and u 2 = min{r c + α, R}. i (θ c -µ i ) cosh(R) -1 u2 l2 sinh(r) 1 -d 2 r U ni-1 (d r ) dr ,The approximations θ r ≈ √ 2 -2 d r and d r ≈ 1 + 2 (e -2 r + e -2 rc -e α-rc-r -e -α-rc-r ) as in Gugelmann et al. (2012) can be used to analyze the behaviors of both integrals. For large R, it holds R e α-rc-r + e -α-rc-r -e -2r -e -2 rc dr -2α -e -r-α+rc -e r-α-rc dr≈ 4e α-R-rc 2 ,where the last approximation is justified by√ 1 + x = 1 + O(x) when |x| ≤ 1. Noting that -1 ≤ d r ≤ 1,one can get rid of the polynomial contribution Therefore, the probability of balls is approximately P[B α (r c , θ c )] =0 c i cos(n i (θ c -µ i )) +

Usual graph shift operators as metric-probability Laplacians.

annex

The proof of the statement can be found in Appendix D. An important consequence of Prop. 7 is that the graph Laplacianobtained with m (i) (x) = x -1 2 for every i ∈ {1, . . . , 4}, is in general different from the symmetric normalized Laplacian, sinceIn light of Prop. 7, the two Laplacians are equivalent if every node is connected to nodes with the same degree, e.g., if the graph is k-regular.The difference between the two Laplacians can be better seen by studying their spectrum. The next proposition introduces an upper bound on the eigenvalues of the Laplacian in (8). Proposition 8. Let G = (V, E) be an undirected graph with adjacency matrix A ∈ R N ×N and degree matrix D = diag(A1). Let λ be an eigenvalue of the graph LaplacianThe proof of the proposition can be found in Appendix D. It is well known that the spectral radius of the symmetric normalized Laplacian is less than or equal to 2 (Chung, 1997), with equality holding for bipartite graphs. However, this is not the case for the Laplacian in (8), as shown in the next example.and the expected average degree of the network isProof of Prop. 7. Equality (2) is trivial, since diagonal matrices commutes; equality (1) follows fromIn order to prove (3), we note that V can be decomposed as V = n i=1 v i e (i) e (i) T . Thereforemust hold for all values of k. Consider the indices k 1 , k 2 , . . . , k n corresponding to the valuesthen A k1,i = 0 for each i such that v i > v k1 . Take the index k 2 and considerThe second addend is 0 because v k2 can be either equal to v k1 , in which case the difference is null, or v k2 > v k1 , in which case from the previous step A k2,k1 = 0. Therefore A k2,i = 0 for each i such that v i > v k2 . By finite induction, the thesis holds when A has null entries in position (i, j) wheneverProof of Prop. 8. The eigenvalues can be characterized via the Rayleigh quotient u, diag D -1 2 AD -1 2 1 -D -1 2 AD -1 2 u ⟨u, u⟩ .Using Prop. 7, and considering u = D 1 2 v the previous formula can be rewritten as

