LATENT GRAPH INFERENCE USING PRODUCT MANIFOLDS

Abstract

Graph Neural Networks usually rely on the assumption that the graph topology is available to the network as well as optimal for the downstream task. Latent graph inference allows models to dynamically learn the intrinsic graph structure of problems where the connectivity patterns of data may not be directly accessible. In this work, we generalize the discrete Differentiable Graph Module (dDGM) for latent graph learning. The original dDGM architecture used the Euclidean plane to encode latent features based on which the latent graphs were generated. By incorporating Riemannian geometry into the model and generating more complex embedding spaces, we can improve the performance of the latent graph inference system. In particular, we propose a computationally tractable approach to produce product manifolds of constant curvature model spaces that can encode latent features of varying structure. The latent representations mapped onto the inferred product manifold are used to compute richer similarity measures that are leveraged by the latent graph learning model to obtain optimized latent graphs. Moreover, the curvature of the product manifold is learned during training alongside the rest of the network parameters and based on the downstream task, rather than it being a static embedding space. Our novel approach is tested on a wide range of datasets, and outperforms the original dDGM model. In this section, we first introduce model spaces, which are a special type of Riemannian manifolds, and explain how to map Euclidean GNN output features to model spaces with non-zero curvature. In case the reader is unfamiliar with the topic, additional details regarding Riemannian manifolds can be found in Appendix A. Then, we mathematically define product manifolds and how to calculate distances between points in the manifold. Next, we introduce scaling metrics which help us learn

1. INTRODUCTION

Graph Neural Networks (GNNs) have achieved state-of-the-art performance in a number of applications, from travel-time prediction (Derrow-Pinion et al. (2021) ) to antibiotic discovery (Stokes et al. (2020) ). They leverage the connectivity structure of graph data, which improves their performance in many applications as compared to traditional neural networks (Bronstein et al. (2017) ). Most current GNN architectures assume that the topology of the graph is given and fixed during training. Hence, they update the input node features, and sometimes edge features, but preserve the input graph topology. A substantial amount of research has focused on improving diffusion using different types of GNN layers. However, discovering an optimal graph topology that can help diffusion has only recently gained attention (Topping et al. (2021) ; Cosmo et al. (2020) ; Kazi et al. (2022) ). In many real-world applications, data can have some underlying but unknown graph structure, which we call a latent graph. That is, we may only be able to access a pointcloud of data. Nevertheless, this does not necessarily mean the data is not intrinsically related, and that its connectivity cannot be leveraged to make more accurate predictions. The vast majority of Geometric Deep Learning research so far has relied on human annotators or simplistic pre-processing algorithms to generate the graph structure to be passed to GNNs. Furthermore, in practice, even in settings where the correct graph is provided, it may often be suboptimal for the task at hand, and the GNN may benefit from rewiring (Topping et al. (2021) ). In this work, we drop the assumption that the graph adjacency matrix is given and study how to learn the latent graph in a fully-differentiable manner, using product manifolds, alongside the GNN diffusion layers. More elaborately, we incorporate Riemannian geometry to the discrete Differentiable Graph Module (dDGM) proposed by Kazi et al. (2022) . We show that it is possible and beneficial to encode latent features into more complex embedding spaces beyond the Euclidean plane used in the original work. In particular, we leverage the convenient mathematical properties of product manifolds to learn the curvature of the embedding space in a fully-differentiable manner. Contributions 1) We explain how to use model spaces of constant curvature for the embedding space. To do so, we outline a principled procedure to map Euclidean GNN output features to constant curvature model space manifolds with non-zero curvature: we use the hypersphere for spherical space and the hyperboloid for hyperbolic space. We also outline how to calculate distances between points in these spaces, which are then used by the dDGM sparse graph generating procedure to infer the edges of the latent graph. Unlike the original dDGM model which explored using the Poincaré ball with fixed curvature for modeling hyperbolic space, in this work we use hyperboloids of arbitrary negative curvature. 2) We show how to construct more complex embedding spaces that can encode latent data of varying structure using product manifolds of model spaces. The curvature of each model space composing the product manifold is learned in a fully-differentiable manner alongside the rest of the model parameters, and based on the downstream task performance. 3) We test our approach on 15 datasets which includes standard homophilic graph datasets, heterophilic graphs, large-scale graphs, molecular datasets, and datasets for other real-world applications such as brain imaging and aerospace engineering. 4) It has been shown that traditional GNN models, such as Graph Convolutional Networks (GCNs) (Kipf & Welling (2017) ) and Graph Attention Networks (GATs) (Veličković et al. (2018) ) struggle to achieve good performance in heterophilic datasets (Zhu et al. (2020) ), since in fact homophily is used as an inductive bias by these models. Amongst other models, Sheaf Neural Networks (SNNs) (Hansen & Gebhart (2020) ; Bodnar et al. (2022) ; Barbero et al. (2022b; a) ) have been proposed to tackle this issue. We show that latent graph inference enables traditional GNN models to give good performance on heterophilic datasets without having to resort to sophisticated diffusion layers or model architectures such as SNNs. 5) To make this work accessible to the wider machine learning community, we have created a new PyTorch Geometric layer.

2. BACKGROUND

In this section we discuss relevant background for this work. We first provide a literature review regarding recent advances in latent graph inference using GNNs as well as related work on manifold learning and graph embedding. Next, we give an overview of the original Differentiable Graph Module (DGM) formulation, but we recommend referring to Kazi et al. (2022) for further details.

2.1. RELATED WORK

Latent graph and topology inference is a standing problem in Geometric Deep Learning. In contrast to algorithms that work on sets and that apply a shared pointwise function such as PointNet (Qi et al. (2017) ), in latent graph inference we want to learn to optimally share information between nodes in the pointcloud. Some contributions in the literature have focused on applying pre-processing steps to enhance diffusion based on an initial input graph (Topping et al. (2021) ; Gasteiger et al. (2019) ; Alon & Yahav (2021) ; Wu et al. (2019) ). Note, however, that this area of research focuses on improving an already existing graph which may be suboptimal for the downstream task. This paper is more directly related to work that addresses how to learn the graph topology dynamically, instead of assuming a fixed graph at the start of training. When the underlying connectivity structure is unknown, architectures such as transformers (Vaswani et al. (2017) ) and attentional multi-agent predictive models (Hoshen (2017) ), simply assume the graph to be fully-connected, but this can become hard to scale to large graphs. Generating sparse graphs can result in more computationally tractable solutions (Fetaya et al. (2018) ) and avoid over-smoothing (Chen et al. (2020a) ). For this a series of models have been proposed, starting from Dynamic Graph Convolutional Neural Networks (DGCNNs) (Wang et al. (2019) ), to other solutions that decouple graph inference and information diffusion, such as the Differentiable Graph Modules (DGMs) in Cosmo et al. (2020) and Kazi et al. (2022) . Note that latent graph inference may also be referred to as graph structure learning in the literature. A survey of similar methods can be found in Zhu et al. (2021) , and some additional classical methods include LDS-GNN (Franceschi et al., 2019) , IDGL (Chen et al., 2020b) , and Pro-GNN (Jin et al., 2020) . In this work, we extend the dDGM module proposed by Kazi et al. (2022) for learning latent graphs using product manifolds. Product spaces have primarily been studied in the manifold learning and graph embedding literature (Cayton (2005) ; Fefferman et al. (2013) ; Bengio et al. (2012) ). Recent work has started exploring encoding the geometry of data into rich ambient manifolds. In particular, hyperbolic geometry has proven successful in a number of tasks (Liu et al. (2019) ; Chamberlain et al. (2017) ; Sala et al. (2018) ). Different manifold classes have been employed to enhance modeling flexibility, such as products of constant curvature spaces (Gu et al. (2019) ), matrix manifolds (Cruceru et al. (2021) ), and heterogeneous manifolds (Di Giovanni et al. (2022) ). We will leverage these ideas and use product manifolds to generate the embedding space for constructing our latent graphs. Kazi et al. (2022) proposed a general technique for learning an optimized latent graph, based on the output features of each layer, onto which to apply the downstream GNN diffusion layers. Here, we specifically focus on the dDGM module (not the cDGM), which is much more computationally efficient and recommended by the authors. The main idea is to use some measure of similarity between the latent node features to generate latent graphs which are optimal for each layer l. We can summarize the architecture as Xpl`1q " f plq Θ pconcatpX plq , Xplq q, A plq q Ñ A pl`1q " P plq p Xpl`1q qq Ñ X pl`1q " g ϕ pX plq , A pl`1q q. The node features in layer l, X plq , are transformed into Xpl`1q through a function f Θ plq , which has learnable parameters, and compared using a similarity measure φpT q, which is parameterized by a scalar learnable parameter T . On the other hand, g ϕ is a diffusion function which in practice corresponds to multiple GNN layers stacked together. g ϕ diffuses information based on the inferred latent graph connectivity structure summarized in A pl`1q , which is an unweighted sparse matrix. The dDGM module generates a sparse k-degree graph using the Gumbel Top-k trick (Kool et al. (2019) ), a stochastic relaxation of the kNN rule, to sample edges from the probability matrix P plq pX plq ; Θ plq , T q, where each entry corresponds to

2.2. AN OVERVIEW OF THE DISCRETE DIFFERENTIABLE GRAPH MODULE

p plq ij pΘ plq q " exppφpf plq Θ px plq i q, f plq Θ px plq j q; T qq " exppφpx pl`1q i , xpl`1q j ; T qq. The main similarity measure used in Kazi et al. (2022) was to compute the distance based on the features of two nodes in the graph embedding space. They assumed that the latent features laid in an Euclidean plane of constant curvature K E " 0, so that p plq ij " expp´T d E pf plq Θ px plq i q, f plq Θ px plq j qqq " expp´T d E px pl`1q i , xpl`1q j qq, where d E denotes distance in Euclidean space. Then, based on argsortplogpp plq i q ´logp´logpqqqq, where q P R N is uniform i.i.d in the interval r0, 1s, we can sample the edges E plq pX plq ; Θ plq , T, kq " tpi, j i,1 q, pi, j i,2 q, ..., pi, j i,k q : i " 1, ..., N u, where k is the number of sampled connections using the Gumbel Top-k trick. This sampling approach follows the categorical distribution p plq ij Σrp plq ir and EpX plq ; Θ plq , T, kq is represented by the unweighted adjacency matrix A plq pX plq ; Θ plq , T, kq. Note that including noise in the edge sampling approach will result in the generation of some random edges in the latent graphs which can be understood as a form of regularization. In this work, we generalize Equation 1 to measure similarities based on distances but dropping the assumption used in Kazi et al. (2022) , in which they limit themselves to fixed-curvature spaces, specifically to Eucliden space where K E " 0. We will use product manifolds of model spaces of constant curvature to improve the similarity measure φ and construct better latent graphs. the curvature of each model space composing the product manifold. A discussion on product manifold curvature learning can be found in Appendix B. The intuition behind the method is that we can consider the embedding space represented by the product manifold as a combination of more simple spaces (model spaces of constant curvature), and compute distances between the latent representations mapped onto the product manifold by considering distances in each model space individually and later aggregating them in a principled manner. This allows to generate diverse embedding spaces which at the same time are computationally tractable.

3.1. CONSTANT CURVATURE MODEL SPACES

Curvature is effectively a measure of geodesic dispersion. When there is no curvature geodesics stay parallel, with negative curvature they diverge, and with positive curvature they converge. Euclidean space, E d E K E " R d E , is a flat space with curvature K E " 0. Note that here we use d E to denote dimensionality. On the other hand, hyperbolic and spherical space, have negative and positive curvature, respectively. We define hyperboloids as H d H K H " tx p P R d H `1 : xx p , x p y L " 1{K H u, where K H ă 0 and x¨, ¨yL is the Lorentz inner product xx, yy L " ´x1 y 1 `řd H `1 j"2 x j y j , @ x, y P R d H `1, and hyperspheres as S d S K S " tx p P R d S `1 : xx p , x p y 2 " 1{K S u, where K S ą 0 and x¨, ¨y2 is the standard Euclidean inner product xx, yy 2 " ř d S `1 j"1 x j y j , @ x, y P R d S `1. Table 1 provides a summary of relevant operators in Euclidean, hyperbolic, and spherical spaces with arbitrary curvatures. The closed forms for the distances between points in hyperbolic and spherical space use the arccosh and the arccos, their domains are tx P R : x ě 1u and tx P R : ´1 ď x ď 1u, respectively. We apply clipping to avoid giving inputs close to the domain limits and prevent from instabilities during training. The latent output features produced by the neural network layers are in Euclidean space and must be mapped to the relevant model spaces before applying the distance metrics. We use the appropriate exponential map (refer to Table 1 ). 

3.2. PRODUCT MANIFOLDS

We define a product manifold as the Cartesian product P " Ś n P i"1 M di Ki , where K i and d i are the curvature and dimensionality of the manifold M di Ki , respectively. We write points x p P P using their coordinates x p " concat ´xp1q p , x p2q p , ..., x pn P q p ¯: x piq p P M di Ki . Also, the metric of the product manifold decomposes into the sum of the constituent metrics g P " ř n P i"1 g i , hence, pP, g P q is also a Riemannian manifold if pM di Ki , g i q, @i are all Riemannian manifolds in the first place. Note that the signature of the product space, that is, its parametrization, has several degrees of freedom: the number of components used, as well as the type of model spaces, their dimensionality, and curvature. If we restrict P to be composed of the Euclidean plane E d E K E , hyperboloids H d H j K H j , and hyperspheres S d S k K S k of constant curvature, we can write an arbitrary product manifold of model spaces as P " E d E K E ˆ˜n H ą j"1 H d H j K H j ¸ˆ˜n S ą k"1 S d S k K S k ¸" E ˆ˜n H ą j"1 H j ¸ˆ˜n S ą k"1 S k ¸, where K E " 0, K H j ă 0, and K S k ą 0. The rightmost part of Equation 2 is included to simplify the notation. P would have a total of 1 `nH `nS component spaces, and total dimension d E Σn H j"1 d H j `Σn S k"1 d S k . As shown in Gallier & Quaintance (2020) , in the case of a product manifold as defined in Equation 2, the geodesics, exponential, and logarithmic maps on P are the concatenation of the corresponding notions of the individual model spaces.

3.3. DISTANCES AND SCALING METRICS FOR PRODUCT MANIFOLDS

To compute distances between points in the product manifold we can add up the square distances for the coordinates in each of the individual manifolds d P px p1 , x p2 q 2 " d E ´xp1q p1 , x p1q p2 ¯2 řn H j"1 d Hj ´xp1`jq p1 , x p1`jq p2 ¯2 `řn S k"1 d S k ´xp1`n H `kq p1 , x p1`n H `kq p2 ¯2 , where the overline denotes that the adequate exponential map to project Euclidean feature entries to the relevant model space has been applied before computing the distance. In practice, this would be equivalent to mapping the feature outputs to the product manifold and operating on P directly. As suggested in Tabaghi et al. (2021) , instead of directly updating the curvature of the hyperboloid and hypersphere model spaces used to construct the product manifold, we can set K H j " ´1, @j, and K S k " 1, @k, and use a scaled distance metric instead. To do so, we introduce learnable coefficients α H j and α S k , d P px p1 , x p2 q 2 " d E ´xp1q p1 , x p1q p2 ¯2 `řnH j"1 ´αH j d Hj ´xp1`jq p1 , x p1`jq p2 ¯¯2 `řnS k"1 ´αS k d Sk ´xp1`nH`kq p1 , x p1`nH`kq p2 ¯¯2 , which is equivalent to learning the curvature of the non-Euclidean model spaces, but computationally more tractable and efficient (for further details on how this coefficients are updated refer to Appendix C.2). This newly defined distance metric can then be applied to calculate the probability of there existing an edge connecting latent features p plq ij pΘ plq q " exp ˆ´T d P ˆf plq Θ px plq i q, f plq Θ px plq j q ˙˙. Hence, E plq pX plq ; Θ plq , T, k, d P q " tpi, j i,1 q, pi, j i,2 q, ..., pi, j i,k q : i " 1, ..., N u. As discussed in Kazi et al. (2022) , the logarithms of the edge probabilities are used to update the dDGM. This is done by incorporating an additional term to the network loss function which will be dependent on log p plq ij pΘ plq q " ´T d P ˆf plq Θ px plq i q, f plq Θ px plq j q ˙" ´T d P ´xplq pi , x plq pj ¯(4) where the additional graph loss is given by L GL " ř N i"1 ´δ py i , ŷi q ř l"L l"1 ř j:pi,jqPE plq log p plq ij ¯, and δ py i , ŷi q " Epac i q ´ac i is a reward function based on the expected accuracy of the model. The loss function to update the dDGM model, L GL , is identical to the original loss proposed by Kazi et al. (2022) , for a brief review one may refer to Appendix C.1. Note that after passing the input x plq i through the dDGM parameterized function f plq Θ , the output f plq Θ px plq i q " xpl`1q i " x plq pi has dimension d E `Σn H j"1 d H j `Σn S k"1 d S k and must be subdivided into 1 `nH `nS subarrays for each of the component spaces. Each subarray must be appropriately mapped to its model space. Hence, the overline in f plq Θ px plq i q in Equation 4. Finally, Figure 1 summarizes the method described in this section. Note that x p in Figure 1 , would correspond to the concatenation of the origins of each model space composing the product manifold. The appropriate exponential map is used to map the points from the tangent plane to the manifold. We construct the latent graph based on the distances on the learned manifold.

4. EXPERIMENTS AND RESULTS

The main objective of the experimental validation is to show that latent graph inference can benefit from using products of models spaces. To do so, we compare the performance of the dDGM module when using single model spaces against Cartesian products of model spaces. The model spaces are denoted as: Euclidean (dDGM-E/dDGM ˚-E, which is equivalent to the original architecture used by Kazi et al. (2022) ), hyperbolic (dDGM-H/dDGM ˚-H), and spherical space (dDGM-S/dDGM ˚-S). The asterisk sign in the model name denotes that the dDGM ˚module is tasked with generating the latent graph without having access to the original adjacency matrix. dDGM models take as input X p0q and A p0q , whereas dDGM ˚models only have access to X p0q . To refer to product manifolds, we simply append all the model spaces that compose the manifold to the name of the module, namely, the dDGM-SS module embedding space is a torus. In practice, we will use the same dimensionality but different curvature for each of the Cartesian components of the product manifolds. Lastly, if a GNN model uses the dDGM module, we name the diffusion layers and the latent graph inference module after. For example, GCN-dDGM-E refers to a GCN that instead of using the original dataset graph for diffusion, incorporates latent graph inference to the network and uses the Euclidean plane as embedding space. Note that we only use a single latent graph inference module per neural network, that is, networks diffuse information based on only one latent graph. This is in line with previous work (Kazi et al. (2022) ). Additionally, in Appendix E.1, we investigate the effect of leveraging multiple latent graphs in the same network and conclude that in general it is better to use a single latent graph due to computational efficiency and diminishing returns. The study regarding computational efficiency can be found in Appendix C.3. In particular, we compare the runtime speedup obtained using symbolic matrices as compared to standard dense PyTorch matrices. We observe that as more product manifolds and dDGM modules are included, the runtime speedup obtained using symbolic matrices becomes increasingly large. Moreover, without using symbolic matrices standard GPUs (we use NVIDIA P100 and Tesla T4) run out of memory for datasets with Op10 4 q nodes such as PubMed, Physics, and CS. Hence, we recommend using symbolic matrices to help with scalability. Model architecture descriptions for all experiments can be found in Appendix G.

4.1. HOMOPHILIC AND HETEROPHILIC BENCHMARK GRAPH DATASETS

We first focus on standard graph datasets widely discussed in the Geometric Deep Learning literature such as Cora, CiteSeer (Yang et al. (2016) ; Lu & Getoor (2003) ; Sen et al. (2008) ), PubMed, Physics and CS (Shchur et al. (2018) ), which have high homophily levels ranging between 0.74 and 0.93. We also present the results for several heterophilic datasets, which have homophily levels between 0.11 and 0.23. In particular, we work with Texas, Wisconsin, Squirrel, and Chameleon (Rozemberczki et al., 2021) . Results of particular interest for these datasets are recorded in Table 2 (benchmark models such as LDS-GNN (Franceschi et al., 2019) , IDGL, IDGL-ANCH (Chen et al., 2020b) , Pro-GNN, Pro-GNN-fs (Jin et al., 2020), and GCN-Jaccard (Wu et al., 2019) are also included). Additional experiments are available in Appendix E.2, E.3, and E.4, in which we perform an in depth exploration of different hyperparameters, compare dDGMs and dDGM ˚s for all datasets, and try many product manifold combinations. Referring back to Table 2 , we can see that product manifolds consistently outperform latent graph inference systems which only leverage a single model space for modeling the embedding space. Also note, that unlike in the work by Kazi et al. (2022) , we do find single hyperbolic model spaces to often outperform inference systems that use the Euclidean plane as embbeding space. This shows that indeed mapping the Euclidean output features of the GNN layers to hyperbolic space using the exponential map before computing distances is of paramount importance (Kazi et al. (2022) ignored the exponential maps required to map features to the Poincaré ball). Table 2 : Results for heterophilic and homophilic datasets combining GCN diffusion layers with the latent graph inference system. We display results using model spaces as well as product manifolds to construct the latent graphs. The First, Second and Third best models for each dataset are highlighted in each table. k denotes the number of connections per node when implementing the Gumbel Top-k sampling algorithm. Additional k values are tested in Appendix E.2. Note that the models which use the Euclidean plane (former dDGM) as embedding space, denoted with an E in the table, are equivalent to those presented in Kazi et al. (2022) . We have shown that the latent graph inference system enables GCN diffusion layers to achieve good performance for heterophilic datasets. We hypothesize that for this to be possible, it should be able to generate homophilic latent graphs, on which GCNs can easily diffuse. In Table 3 we display the homophily levels of the learned latent graphs, which corroborates our intuition. As we can see from the results all models are able to generate latent graphs with higher homophily than those of the original dataset graphs. The latent graph inference system seems to find it easier to increase the homophily levels of smaller datasets, which is reasonable since there is less information to reorganize. There is a clear correlation between model performance in terms of accuracy (Table 2 ) and the homophily level that the dDGM ˚modules are able to achieve for the latent graphs (Table 3 ). Table 3 : Homophily level of the learned latent graphs. Latent graph inference modules which use different manifolds to generate their respective latent graphs achieve different homophily levels. Also, depending on weight initialization, the inference system can converge to slightly different latent graphs. Published as a conference paper at ICLR 2023 For example, in the case of Texas we achieve the highest homophily levels for the latent graphs of between 0.85 ˘0.02 and 0.91 ˘0.03, and also some of the highest accuracies ranging from 73.88 ˘9.95% to 81.67 ˘7.05%. For Wisconsin the homophily level is lower than that for Texas, but this can be attributed to the fact that k " 10, inevitably creating more connections with nodes from other classes. Also, in Wisconsin there are two classes with substantially less nodes than the rest, meaning that a high accuracy can be achieved even if those are misclassified. On the other hand, for Squirrel, although the latent graph inference system still manages to increase homophily from 0.22 in the original graph to between 0.27 ˘0.00 and 0.43 ˘0.03 in the latent graph, the increase is not big as compared to the other datasets and we can see how this also has an effect on performance. In Table 2 the maximum accuracy for Squirrel is of 35.00 ˘2.35%. Note that this is still substantially better than using a MLP or a standard GCN, which obtain accuracies of 30.44˘2.55% and 24.19˘2.56%, respectively. The same discussion applies to Chameleon. Figure 2 displays how the graph connectivity is modified during the training process. This shows that the inference system is able to dynamically learn an optimal connectivity structure for the latent graph based on the downstream task, and modify it accordingly during training. Additional latent graph plots for the different datasets can be found in Appendix F. 2021)), and challenges networks to classify different regions of a shock wave around a rocket (refer to Appendix D for more information on the dataset). Note that since neither of these datasets have a graph structure (they are pointclouds), only the dDGM ˚can be utilized in this case. Results are given in Table 4 . We use GAT diffusion layers and compare the performance using single model spaces and product manifolds. Almost all models using the latent graph inference system outperform the performance of the MLP. It is important to note that since the datasets do not provide an input graph it would not be possible to use GAT models without using dDGM ˚modules. Again, we find that using product manifolds to model the latent space of potentially complex real-world datasets proves beneficial and boosts model accuracy. In principle, the Aerothermodynamics datasets classifies shock regions based on the flow absolute velocity into 4 regions as recorded in Table 4 . However, we tested increasing the number of classes by further subdividing the flow into a total of 7 regions, as shown in Figure 3 . We found that interestingly, the latent graph inferred by the model does not only cluster nodes with similar labels together, but it actually organizes the latent graph in order based on the absolute velocities (which are not explicitly given to the model: the velocities are used to create the labels but values are not provided to the model as input). This suggests that the graph generation system is organizing the latent graph based on some inferred high level understanding of the physics of the problem. Additional latent graphs for these datasets are provided in Appendix F.2. 

4.3. SCALING TO LARGE GRAPHS

All datasets considered so far are relatively small. In this section we work with datasets from the Open Graph Benchmark (OGB) which contain large-scale graphs and require models to perform realistic out-of-distribution generalization. In particular, we use the OGB-Arxiv and the OGB-Products datasets. As discussed in Appendix C.3.3, for training on these datasets we use graph subsampling techniques, and Graph Attention Network version 2 (GATv2) diffusion layers (Brody et al. (2021) ); since we do not expect overfitting we can use more expressive layers. OGB-Arxiv and OGB-Products have a total of 40 and 47 classes, respectively. Previous datasets only considered multi-class classification for between 3 to 15 classes. This, added to the fact that the datasets have orders of magnitude more nodes and edges, makes the problems in this section considerably more challenging. For OGB-Arxiv using a MLP and a GATv2 model without leveraging latent graph inference we obtain accuracies of 63.49 ˘0.15% and 61.93 ˘1.62%, respectively. The best model with latent graph inference, GATv2-dDGM ˚-EHS, achieves an accurary of 65.06 ˘0.09%. For OGB-Products, the MLP and a GATv2 results are 66.05 ˘0.20% and 62.02 ˘2.60%, and for the best model, GATv2-dDGM-E, we record an accuracy of 66.59 ˘0.30%. From the results (more in Appendix E.5), we conclude that latent graph inference is still beneficial for this larger datasets but there is substantial room for improvement. Graph subsampling interferes with embedding space learning.

5. DISCUSSION AND CONCLUSION

In this work we have incorporated Riemannian geometry to the dDGM latent graph inference module by Kazi et al. (2022) . First, we have shown how to work with manifolds of constant arbitrary curvature, both positive and negative. Next, we have leveraged product manifolds of model spaces and their convenient mathematical properties to enable the dDGM module to generate a more complex homogeneous manifold with varying curvature which can better encode the latent data, while learning the curvature of each model space composing the product manifold during training. We have evaluated our method on many and diverse datasets, and we have shown that using product manifolds to model the embedding space for the latent graph gives enhanced downstream performance as compared to using single model spaces of constant curvature. The inference system has been tested on both homophilic and heterophilic benchmarks. In particular, we have found that using optimized latent graphs, diffusion layers like GCNs are able to successfully operate on datasets with low homophily levels. Additionally, we have tested and proven the applicability of our method to large-scale graphs. Lastly, we have shown the benefits of applying this procedure in real-world problems such as brain imaging based data and aerospace engineering problems. All experiments discussed in the main text are concerned with transductive learning; however, the method is also applicable to inductive learning, see Appendix E.6. The product manifold embedding space approach has provided a computationally tractable way of generating more complex homogenous manifolds for the latent features' embedding space. Furthermore, the curvature of the product components is learned rather than it being a fixed hyperparameter, which allows for greater flexibility. However, the number of model spaces to generate the product manifold must be specified before training. It would be interesting to devise an approach for the network to independently add more model spaces to the product manifold when needed. Also, we are restricting our approach to product manifolds based on model spaces of constant curvature due to their suitable mathematical properties. Such product manifolds do not cover all possible arbitrary manifolds in which the latent data could be encoded and hence, there could still be, mathematically speaking, more optimal manifolds to represent the data. It is worth exploring whether approaches to generate even more diverse and computationally tractable manifolds would be possible. Future Work Lastly, there are a few limitations intrinsic to the dDGM module, irrespective of the product manifold embedding approach introduced in this work. Firstly, although utilizing symbolic matrices can help computational efficiency (Appendix C.3), the method still has quadratic complexity. Kazi et al. (2022) proposed computing probabilities in a neighborhood of the node and using tree-based algorithms to reduce it to Opn log nq. Moreover, the Gumbel Top-k sampling approach restricts the average node degree of the latent graph and requires manually adjusting the k value through testing. A possible solution could be to use a distance based sparse threshold approach in which an unweighted edge is created between two nodes if they are within a threshold distance of each other in latent space. This is similar to the Gumbel Top-k trick, but instead of choosing a fixed number of closest neighbors, we connect all nodes within a distance. This could help better capture the heterogeneity of the graph. However, we actually tested this approach and found it quite unstable. Note that although we do not have the k parameter anymore, we must still choose a threshold distance. Another avenue to help with scalability, improve computational complexity, and facilitate working with large-scale graphs would be to use a hierarchical perspective. Inspired by brain interneurons (Freund & Buzsáki (1996) ), we could introduce fictitious connector inducing nodes in different regions of the graph, use those nodes to summarize different regions of large graphs, and apply the kNN algorithm or the Gumbel Top-k trick to the fictitious connector inducing nodes. This way the computational complexity would still be quadratic, but proportional to the number of interconnectors. 

A RIEMANNIAN MANIFOLDS

In Riemannian geometry, we define a Riemannian manifold or Riemannian space pM, gq as a real and differentiable manifold M in which each tangent space has an associated inner product g, that is, a Riemannian metric, which must vary smoothly when considering points in the manifold. The Riemannian manifold M Ď R N (lives in the ambient space R N ) is a collection of real vectors, and is locally similar to a linear space. The Riemannian metric generalizes inner products for Riemannian manifolds. It also allows to define geometric notions on a Riemannian manifold such as lengths of curves, curvature, and angles to name a few. If the point x p P M, then we can denote the tangent space at x p as T xp M, which has the same dimensionality as M. T xp M is the collection of all tangent vectors at x p . Moreover, g xp : T xp M ˆTxp M Ñ R is given by a positivedefinite inner product in the tangent space and depends smoothly on x p . A geodesic represents the shortest smooth path between two points in a Riemannian manifold and generalizes the notion of a straight line in Euclidean space. The length of a smooth continuously differentiable curve γ : t Ñ γptq P M, t P r0, 1s is given by Lpγq " ż 1 0 ||γ 1 ptq|| dt. Note that Lpγq is unchanged by a monotone reparametrization. The geodesic distance between two points x pi , x pj P M is defined as the infimum (greatest lower bound) of the length taken over all piecewise continuously differentiable curves, such that γ xp i ,xp j " argmin γ Lpγq : γp0q " x pi , γp1q " x pj . The norm of a tangent vector v P T xp M is given by ||v|| " b g xp pv, vq. Moving from a point x p P M with initial constant velocity v P T xp M is formalized by the exponential map exp xp : T xp M Ñ M, which gives the position of the geodesic at t " 1 so that exp xp pvq " γp1q. There is a unique unit speed geodesic γ which satisfies γp0q " x p and γ 1 p0q " v. On the other hand, and less relevant to the work at hand, the logarithmic map is the inverse of the exponential map log xp " exp ´1 xp : M Ñ T xp M. In geodesically complete Riemannian manifolds, both the exponential and logarithm maps are welldefined (Needham (1997)) . In general, it can be impossible to find a solution for the geodesic between two points for an arbitrary manifold. As discussed in the main text, we want to be able to move beyond constant curvature spaces and compute the similarity measure φ for latent features which may reside in a more general and learnable manifold. This will enable us to more accurately model data which may comprise varying structure, beyond that which can be represented in Euclidean space, and also beyond hierarchical or cyclical data which can be linked to constant curvature hyperbolic and spherical spaces, respectively. That is, by varying structure we refer to data which may present different underlying patterns in different regions of space. Generating more complex manifolds we intend to minimize the distortion incurred in the data and to represent the points in a more suitable manifold for the downstream task. However, we must still be able to map the Euclidean output features of the network to our learnable manifold and to compute distances between points. To generate a more complex and learnable manifold while still having a closed form solution for the exponential map and geodesics, we will introduce a product manifold embedding space composed of multiple copies of simple model spaces with constant curvature. Although product manifolds based on constant curvature models still fall under the homogeneous manifold category (Kowalski et al. (1989) ), they allow to model more complex embedding spaces for the dDGM module than those that can be represented using only constant curvature homogeneous spaces. For example, the standard torus, which can be obtained by multiplying two spheres, is a homogeneous manifold. Nevertheless, some points on the manifold have positive Gaussian curvature (outer part of the torus, elliptic points), some have negative (inner part, hyperbolic points), and others have zero (parabolic points). This is because for the torus the Euler characteristic (Beltramo et al. (2021) ) is zero so it must always have regions of negative Gaussian curvature. This is to counterbalance the regions of positive curvature guaranteed in Hilbert's theorem. The main point is that using product manifolds of constant curvature spaces we can generate manifolds with regions with different Gaussian curvature that can help us better represent data structures which may not be purely Euclidean, hyperbolic, or spherical. For all points on the manifold, we can define a normal vector that is at right angles to the surface. By generating intersections between normal planes (that contain the normal vector) and the surface we can compute normal sections. In general, different normal sections will have different curvatures. We refer to κ 1 and κ 2 as the principal curvatures. They correspond to the maximum and minimum values that the curvature can take at a given point (Kühnel (2005) ). The Gaussian curvature K is the product of the two K " κ 1 κ 2 . A Riemannian manifold is said to have constant Gaussian curvature K if secpP q " K for all two-dimensional linear subspaces P Ă T xp M and for all x p P M. Manifolds can be classified into three classes depending on their curvature: flat space, positively curved space, and negatively curved space. B FURTHER DISCUSSION ON CURVATURE LEARNING Next, we aim to provide a more detailed explanation on the topic of curvature learning for product manifolds and the implemented procedure using learnable distance-metric-scaling coefficients. Let us discuss product manifolds which are solely generated based on Cartesian products of the same model spaces, such as P H " n H ą j"1 H d H j K H j , which uses Cartesian products of hyperboloids. Likewise, taking Cartesian products of hyperspheres we would obtain P S " n S ą k"1 S d S k K S k . ( ) For these cases, although d H d H j K H j ´xpjq p1 , x pjq p2 ¯and d S d S k K S k ´xpkq p1 , x pkq p2 ¯are in practice computed all using hyperboloids and hyperspheres with the same fixed curvature K H j " ´1, @j, and K S k " 1, @k, the scaling coefficients control the curvature of the spaces individually, d P H px p1 , x p2 q " g f f f e n H ÿ j"1 ¨αH j d H d H j K H j ´xpjq p1 , x pjq p2 ¯‚ 2 , and, d P S px p1 , x p2 q " g f f e n S ÿ k"1 ˜αS k d S d S k K S k ´xpkq p1 , x pkq p2 ¯¸2 . ( ) That is, α H j and α S k scale the distances generated based on unit hyperboloids and hyperspheres, respectively. Using the scaled metrics, we are effectively still computing Cartesian products of hyperboloids and hyperspheres with different curvatures, so that P H " n H ą j"1 H d H j K H j ‰ n H ą j"1 H d H j K H , P S " n S ą k"1 S d S k K S k ‰ n S ą k"1 S d S k K S . This guarantees we are retrieving equivalent values to those which would be generated by model spaces with different curvatures. This is done to avoid backpropagating through operators such as exponential maps and closed form solutions for distances in the different model spaces. Considering again an arbitrary product manifold of all model spaces as in Equation 2, we can control the curvature through the following derivatives Bd P Bα H j , for the hyperboloid terms, and Bd P Bα S k , for the hyperspheres. In the case of Euclidean space the curvature is always K E " 0, so there is no need to learn it. Also, using a Cartesian product of Euclidean spaces would be equivalent to using a single Euclidean space of greater dimensionality P E " n E ą i"1 E d E i K E i " E d E K E , where K E i " K E " 0, @i, and d E " ř n E i d E i . Hence, in Equation 2we used a single Euclidean space. The behavior described in Equation 20 can be better appreciated by comparing distance d P E obtained using P E and d E d E K E from E d E K E : d 2 P E " n E ÿ i"1 ˜dE d E i K E i ¸2 " n E ÿ i"1 ˆdE d E i K E ˙2 , d 2 P E " d 2 E d E 1 K E `d2 E d E 2 K E `... `d2 E d E n E K E . ( ) Considering that the equation for independent distances in each Euclidean space is d E d E i K E " d ÿ a ´xpaq p1 ´xpaq p2 ¯2, we obtain, d P E " d ÿ a ´xpaq p1 ´xpaq p2 ¯2 `ÿ b ´xpbq p1 ´xpbq p2 ¯2 `... `ÿ z ´xpzq p1 ´xpzq p2 ¯2, which is analogous to taking the second power of the difference of two points in another Euclidean space with more dimensions, so that d P E " d E d E K E . ( ) This behavior is only applicable to Euclidean space (Gu et al. (2019) ), when multiplying other model spaces P H " n H ą j"1 H d H j K H j ‰ H d H K H , P S " n S ą k"1 S d S k K S k ‰ S d S K S , even when the curvature is the same for all hyperboloids and hyperspheres, K H j " K H , @j and K S k " K S , @k. For example, multiplying two hyperspheres will result in a hypertorus, not a higherdimensional hypersphere. As a final remark, note that any loss function which is dependent on the distance function induced by the Riemannian metric of a Riemannian manifold is locally smooth on M (where M could be a product manifold P) and can be optimized by first order methods. 0 ă α ptq ă 1. Using ReLU instead of S could allow α ptq to take arbitrarily large positive values, but we would have gradient problems if α becomes negative. Given that using the sigmoid function bounds the maximum value that the scaling metrics can take, we must multiply the Euclidean plane distance by its own scaling coefficient. Since the Gumbel Top-k trick selects the closest points to generate unweighted edges, rather than storing the actual geodesics, if we scale the Euclidean plane contribution down when necessary, this would equate to having other model spaces with much bigger curvature than that which is actually possible due to the scaling coefficient being now bounded by zero and one. For example, if we were to have a product manifold based on the Cartesian product of the Euclidean plane and a hypersphere, if the model learns a scaling metric close to zero for the Euclidean plane geodesics, this would be equivalent to having a hypersphere with a really large curvature since edges would only be generated based on the distances calculated on the hypersphere. Finally, note that in practice we use the Adam optimizer instead of simple gradient descent for training.

C.3 COMPUTATIONAL EFFICIENCY

In this section we discuss some of the implementation techniques used to make computation more efficient. We cover two main topics: symbolic handling of distance metrics and graph subsampling. One of the main computational limitations of the approach described in this work is that we must compute distances between all points to generate the latent graph. Although the discrete graph sampling method used by dDGM is more computationally efficient than its continuous counterpart, cDGM -because it generates sparse graphs that make convolutional operators lighterwe quickly run into memory problems for graph datasets with Op10 4 q nodes. Starting from a pointcloud, we must compute the distances between all points to determine whether a connection should be established. This is problematic, since the computational complexity scales quadratically as a function of the number of nodes in the graph, which can rapidly become intractable as we increase the size of our graph. Most of the experiments were performed using NVIDIA Tesla T4 Tensor Core GPUs with 16 GB of GDDR6 memory, NVIDIA P100 GPUs with 16 GB of CoWoS HBM2 memory, or NVIDIA Tesla K80 GPUs with 24 GB of GDDR5 memory. All these GPUs have limited memories that are easily exceeded during backpropagation for datasets other than Cora and CiteSeer (Yang et al. (2016) ; Lu & Getoor (2003) ; Sen et al. (2008) ), which have 2,708 nodes and 3,327 nodes, respectively. For example, using the standard PyTorch Geometric implementation we are not able to backpropagate for the PubMed dataset which has 18,717 nodes.

C.3.1 SYMBOLIC HANDLING OF DISTANCE METRICS

To avoid memory overflows we resort to Kernel Operations (KeOps) (Charlier et al. (2021) ), which makes it possible to compute reductions of large arrays whose entries are given by a mathematical formula. We can classify matrices in three categories: dense, sparse, and symbolic. Dense matrices are dense numerical arrays M ij " M ri, js which put a heavy load on computer memory. When they increase in size, they can struggle to fit into RAM or GPU memory. This is what happens in our case, when calculating distances between points. Sparse matrices are typically used to try to address this problem. Sparse matrices use lists of indices pi n , j n q and associated values M n . Hence, in sparse matrices we only store the values for non-zero entries. The main limitation of this approach is that the computing speedup obtained by using sparse matrices is highly dependent on sparsity. To obtain significant improvements in performance using sparse encoding, the original matrix should effectively be more than 99% empty (Charlier et al. (2021) ). This significantly constrains the applicability of sparse encoding. Alternatively, KeOps uses symbolic matrices, to represent matrices which can be summarized using an underlying common mathematical structure. In this setup the matrix entries M ij are represented as a function of vectors x i and x j , so that M ij " F px i , x j q. ( ) Even if these objects are not necessarily sparse, they can be represented using only small data arrays x i and x j , which can result into large improvements in computational efficiency, avoiding memory overflow. In our case, the matrices we are working with are fully populated and using sparse tensors would not enhance performance. On the other hand, the symbolic approach implemented using KeOps enables us to represent the original dense matrix containing all the distances between all the points in a much more compact way, and to work with larger graphs than Cora and Citeseer (Yang et al. (2016) ; Lu & Getoor (2003) ; Sen et al. (2008) ), namely, PubMed, CS, and Physics (Shchur et al. (2018) ). Finally, note that in our case the function used by the symbolic matrix is the distance metric appropiate for whichever manifold we are using to construct the latent graph. M ij " dpx i , x j q. (34)

C.3.2 QUANTITATIVE STUDY OF RUNTIME SPEEDUP

We quantify the runtime speedup obtained by using symbolic handling of the distance metrics when generating latent graphs for Cora and CiteSeer. Note, however, that improved execution time is not the only benefit of symbolic handling. As discussed in Section C.3.1, symbolic matrices also help avoid memory overflow for larger graphs. Indeed, this is the reason we choose Cora and CiteSeer to conduct these experiments: using other datasets we experience GPU memory overflow when using dense matrices, and hence we cannot compare dense to symbolic matrix performance. It should also be highlighted that using both dense and symbolic matrices gives the same results in terms of accuracy since they are equivalent mathematically, the difference lies in the computational efficiency of each method. We display the results in Table 5 and Table 6 , in which we record execution times for a 100 epochs using a NVIDIA P100 GPU. We evaluate the effect of increasing the number of dDGM layers on the runtime. We also compare the dDGM module using Euclidean (GCN-dDGM-E) and a product manifold of Euclidean, hyperbolic, and spherical space (GCN-dDGM-EHS) to generate the latent graphs. Table 5 : Results for runtime speedup quantification using symbolic as compared to dense matrices. These results are for the GCN-dDGM-E model using k " 3 and training for 100 epochs using NVIDIA P100 GPU. We can see that as we add more dDGMs the difference between using dense and symbolic matrices becomes more substantial. Likewise, the benefit from using symbolic matrices becomes more apparent when using product manifolds, this is because more distances must be computed. The product manifolds are calculated based on the Cartesian product of three model spaces and hence, to obtain the overall distance between each of the nodes, we must compute geodesics in all constant curvature manifolds independently. Another observation that we can make is that the computation time for CiteSeer increases substantially as compared to Cora using dense matrices. CiteSeer only has 995 more nodes than Cora, yet using 3 dDGM-EHS layers the execution time for 100 epochs increases by 5.53 seconds for dense matrices. This clearly shows that using dense matrices can quickly become hard to scale for larger graphs, which can have orders of magnitude more nodes than CiteSeer. In line with the literature, using symbolic matrices is more computationally tractable for larger graphs Kazi et al. (2022) . Also, we run some additional experiments to quantify the increase in computation time as a function of k, that is, the number of edges per latent graph node when applying the Gumbel Top-k trick. As we can see in Table 7 and Table 8 , k does not seem to have a statistically significant impact on the execution time. As before, we find that using symbolic matrices is consistently more efficient. Table 7 : Results for runtime speedup quantification using symbolic as compared to dense matrices for different k " 1 ´30. These results are for the GCN-dDGM-E model and training for 100 epochs using a Tesla T4 GPU. 

C.3.3 TRAINING ON LARGE GRAPHS

Although symbolic handling of distance metrics is certainly necessary, for larger graphs such as the graphs for node property prediction of the Open Graph Benchmark (OGB) (Hu et al. (2020) ) which have Op10 5 q ´Op10 8 q nodes and Op10 6 q ´Op10 9 q edges, we must combine KeOps with graph subsampling techniques to make backpropagating computationally tractable. We apply a neighbor sampler to track message passing dependencies for the subsampled nodes. This allows computation to be more lightweight. Based on the message passing equation x pl`1q i " ϕ ´xplq i , à jPN pviq ψpx plq i , x plq j q ¯, to calculate x pl`1q i we must aggregate, and hence, have stored the node features of its neighbors when subsampling the graph. Note that unlike in the original node prediction setup in which we give as input all the nodes and predict properties for all nodes in the complete graph, here we have a different number of input and output nodes, which gives rise to a bipartite structure for multi-layer minibatch message passing. Such a bipartite graph, which samples only the necessary input and output nodes from the original graph, is called a message flow graph (Ladkin & Leue (2005) ). For every node that we compute in a given batch we will need to track its message flow graph alongside all relevant dependencies.

D THE AEROTHERMODYNAMICS DATASET

We generate a multi-class classification dataset based on Computational Fluid Dynamics (CFD) simulations conducted for the rocket designs developed for the Karman Space Programme. Specifically, the dataset is generated based on the shock wave velocity distribution around the nose of a rocket at 10 degrees of angle of attack. The simulation meshgrid has varying degrees of resolution, and the shock wave is not symmetric due to the angle of attack. Although CFD software uses a meshgrid to discretize space and run the aerothermodynamics simulations, it can be challenging to extract the connectivity of the original graph, since most often the software is designed to only output a pointcloud. Moreover, given that across a shock wave, the static pressure, temperature, and gas density increases almost instantaneously and there is an abrupt decrease in the flow area, the original graph can present high heterophily. Hence, latent graph inference can be beneficial. For the dataset, we only consider the shock region at the leading edge of the rocket. To obtain the class labels, we separate the shock flow absolute velocity into 4 regions: ď 300 m/s, 300 ´450 m/s, 450 ´650 m/s and ą 650 m/s. The original simulation has 207,745 datapoints, but we only focus on the shock around the nose of the rocket and apply graph coarsening, see Figure 4 . The network input is a pointcloud with pressure values. Note that the coordinates with respect to the rocket are not given to the network. The model is tasked with classifying the shock into four intensity regions which are based on the absolute velocity distribution. Table 9 summarizes the dataset. 

E ADDITIONAL EXPERIMENTS AND RESULTS

In this appendix we include additional experiments for the latent graph inference system. In Section E.1 we explore using more than one latent graph inference system per GNN model. In Section E.2 we include additional results for the homophilic graph datasets in which we vary the value of k for the graph generation algorithm. In Section E.3 we experiment with using a greater number of model spaces to generate product manifolds for the embedding space. Section E.4 includes results for heterophilic datasets, Section E.5 for OGB, and Section E.6 for inductive learning. The First, Second and Third best models for each dataset are highlighted in each table.

E.1 NUMBER OF DDGM MODULES

In these experiments, we investigate whether there is any benefit in stacking multiple dDGM modules. Using multiple dDGMs effectively means that within the model, the network layers would be learning based on different latent graphs. In principle, according to the results in Table 10 and 11 , there is no clear improvement in performance for Cora and CiteSeer. In fact, the accuracy of the models can decrease. It is only for PubMed that there is sometimes some improvement in the accuracy. These results align with previous studies by Kazi et al. (2022) . Finally, we also run a few experiments for Physics and CS. Again, we find no substantial improvement using an additional dDGM module, see Table 12 . Given our findings, we use a single dDGM module, since it is more computationally efficient. 

E.2 HOMOPHILIC GRAPH DATASETS EXTENDED RESULTS

In this section we include extended results for the Cora, CiteSeer, PubMed, Physics, and CS datasets. Table 13 presents results using single model spaces for the embedding space, and Table 14 and Table 15 using product manifolds. As discussed in the main text, the dDGM ˚does not use the original dataset graph as inductive bias, since it is only provided with a pointcloud. On the other hand, the dDGM does use the original graph. Different k values are applied.

E.3 MORE COMPLEX PRODUCT MANIFOLDS RESULTS

The main objective of this section is trying to test the limits of our approach. In Table 16 we display the results multiplying up to five model spaces for the CS dataset. From the results, we can see that adding more product manifolds can result in improved performance. The GCN-dDGM-EHHSS network with k " 5 obtains an accuracy of 93.10 ˘0.74%, as compared to the best single model space based model GCN-dDGM-E with k " 5 which achieves a result of only 87.88 ˘2.55%.

E.4 HETEROPHILIC GRAPH DATASETS EXTENDED RESULTS

In Table 17 we display the results using both the dDGM and the dDGM ˚module for heterophilic datasets. The dDGM module uses the original dataset graph as inductive bias for the generation of the latent graphs. As expected, this leads to worse results than those using the dDGM ˚for heterophilic graphs, since the original graph is not good for diffusion using GCNs. It is better to start directly from a pointcloud and completely ignore the original adjacency matrix A p0q since it does not provide the model with a good inductive bias.

E.5 RESULTS FOR LARGE GRAPHS FROM THE OPEN GRAPH BENCHMARK

Table 18 and Table 19 display results for the OGB-Arxiv and OGB-Products datasets. Although the OGB-Products dataset is considerably larger than the OGB-Arxiv both in terms of the number of nodes and edges, our models achieve slightly better performance. This may be due to the fact that the OGB-Products dataset is an undirected graph, whereas the OGB-Arxiv dataset is directed. The dDGM module generates undirected edges between nodes by construction (if we neglect the noise in the Gumbel Top-k trick), which may be affecting performance in the case of the OGB-Arxiv dataset. 

F LATENT GRAPH LEARNING PLOTS

In this appendix we include additional plots for the learned latent graphs for the heterophilic datasets discussed in the main text as well as for the TadPole and the Aerothermodynamics dataset.

F.1 LEARNED LATENT GRAPHS FOR HETEROPHILIC DATASETS

In Figure 5 , 6, 7, and 8, we display the original graphs provided by the heterophilic datasets, and compare them to the latent graphs generated by the dDGM modules. From the plots we can clearly see the high homophily levels of the Texas latent graph in Figure 5 , for which we obtain four distinct clusters. In the case of the bigger datasets in Figure 7 , and 8, the algorithm is still able to create clusters but there is mixing between classes. Note that the generated latent graphs are dependent on both the downstream task and the diffusion layers we are using, that is, the GCNs. Since we have created a fully-differentiable system the models are optimizing the latent graph generation together with the rest of the model parameters. Hence, if we were to change the downstream task or the diffusion layers we would expect different latent graphs. In Figure 2 from Section 4.1 we showed the latent graph learning evolution plots as a function of training epochs for the Texas dataset. We provide additional plots for other datasets. This shows that the latent graph inference system is applicable across a wide range of datasets and that is learns to organize the connectivity of the latent graph during training. Recall that each node in the graph is represented with a point and each class is assigned a different color. In (a) there is no structure, after 1,000 epochs in (d) the algorithm has been able to organize the graph structure to separate some of the classes, but there is still a substantial amount of mixing. 



Note that in this paper they define curvature differently.



Figure 1: Diagram depicting mapping procedure from GNN Euclidean output to latent manifold P.The appropriate exponential map is used to map the points from the tangent plane to the manifold. We construct the latent graph based on the distances on the learned manifold.

(a) Epoch 1, h " 0.39. (b) Epoch 5, h " 0.46. (c) Epoch 10, h " 0.50. (d) Epoch 100, h " 0.74. (e) Epoch 500, h " 0.81. (f) Epoch 1000, h " 0.93.

Figure 2: Latent graph homophily level, h, evolution as a function of training epochs for Texas. The latent graphs are produced during the training process for the GCN-dDGM ˚-EH model with k " 2.

Figure 4: Pointcloud plot of shock intensity regions. The different regions are represented using different colors which correspond to each target class. This is the dataset after applying graph coarsening.

(a) Original graph, h " 0.11.(b) Learned latent graph, h " 0.93.

Figure 5: Texas, original vs learned latent graph. The learned latent graph displayed was learned using the GCN-dDGM ˚-EH model with k " 2 in the Gumbel Top-k trick.

Figure 6: Wisconsin, original vs learned latent graph. The learned latent graph displayed was learned using the GCN-dDGM ˚-EHS model with k " 10 in the Gumbel Top-k trick.

Figure 7: Squirrel, original vs learned latent graph. The learned latent graph displayed was learned using the GCN-dDGM ˚-S model with k " 3 in the Gumbel Top-k trick. Note that both graphs (a) and (b) have the same number of nodes, but in (a) nodes are displayed more closely packed due to the connectivity structure of the graph.

Figure 8: Chameleon, original vs learned latent graph. The learned latent graph displayed was learned using the GCN-dDGM ˚-E model with k " 3 in the Gumbel Top-k trick. Note that both graphs (a) and (b) have the same number of nodes, but in (a) nodes are displayed more closely packed due to the connectivity structure of the graph.

(a) Epoch 1, h " 0.34. (b) Epoch 5, h " 0.43. (c) Epoch 10, h " 0.46. (d) Epoch 100, h " 0.52. (e) Epoch 500, h " 0.62. (f) Epoch 1000, h " 0.64.

Figure 9: Latent graph homophily level, h, evolution as a function of training epochs for Wisconsin. The latent graphs are produced during the training process for the GCN-dDGM ˚-EHS model with k " 10.

(a) Epoch 1, h " 0.20. (b) Epoch 100, h " 0.26. (c) Epoch 500, h " 0.31. (d) Epoch 1000, h " 0.32.

Figure10: Latent graph homophily level, h, evolution as a function of training epochs for Squirrel using the GCN-dDGM ˚-S model with k " 3. Recall that each node in the graph is represented with a point and each class is assigned a different color. In (a) there is no structure, after 1,000 epochs in (d) the algorithm has been able to organize the graph structure to separate some of the classes, but there is still a substantial amount of mixing.

(a) Epoch 1, h " 0.19. (b) Epoch 100, h " 0.36. (c) Epoch 500, h " 0.40. (d) Epoch 1000, h " 0.42.

Figure 11: Latent graph homophily level, h, evolution as a function of training epochs for Chameleon using the GCN-dDGM ˚-ES model with k " 5.

Epoch 1, h " 0.44. (b) Epoch 5, h " 0.42. (c) Epoch 10, h " 0.46. (d) Epoch 100, h " 0.72. (e) Epoch 400, h " 0.88. (f) Epoch 800, h " 0.92.

Figure14: Latent graph homophily level, h, evolution as a function of training epochs for TadPole. The latent graphs shown here were obtain using the GCN-dDGM ˚-H model with k " 3.

Epoch 1, h " 0.43. (b) Epoch 10, h " 0.46. (c) Epoch 100, h " 0.65. (d) Epoch 800, h " 0.90.

Figure 15: Latent graph homophily level, h, evolution as a function of training epochs for TadPole obtained using the GCN-dDGM ˚-EHS model with k " 7. Nodes with different colors correspond to different target classes. Starting from a unstructured graph in (a), the algorithm is able to generate a highly homophilic graph after 800 epochs in (d).

(a) Epoch 1, h " 0.35. (b) Epoch 10, h " 0.87. (c) Epoch 500, h " 0.86. (d) Epoch 1000, h " 0.88.

Figure16: Latent graph homophily level, h, evolution as a function of training epochs for Aerothermodynamics. The latent graphs shown here were obtain using the GAT-dDGM ˚-EHH model with k " 7.

Figure17: Latent graph homophily level, h, evolution as a function of training epochs for Aerothermodynamics. The latent graphs shown here were obtain using the GAT-dDGM ˚-H model with k " 7.

Relevant operators (exponential maps and distances between two points) in Euclidean, hyperbolic, and spherical spaces with arbitrary constant curvatures.

´K H ||x|| ˘, sinh `?´K H ||x|| ˘x ? ´KH ||x|| ¯. Similarly for the hypersphere using as reference point o S K

Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel, and Thierry Langer. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies, 37:1-12, 2020. ISSN 1740-6749. Huijun Wu, Chen Wang, Yu. O. Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial examples for graph data: Deep insights into attack and defense. In International Joint Conference on Artificial Intelligence, 2019. Luhuan Wu, Andrew Miller, Lauren Anderson, Geoff Pleiss, David M. Blei, and John P. Cunningham. Hierarchical inducing point gaussian process for inter-domain observations. In AISTATS, 2021. Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings, 2016. Zaixin Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. Motif-based graph selfsupervised learning for molecular property prediction. ArXiv, 2021. Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. arXiv: Learning, 2020. Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. Deep graph structure learning for robust representations: A survey. ArXiv, abs/2103.03036, 2021.

Results for runtime speedup quantification using symbolic as compared to dense matrices. These results are for the GCN-dDGM-EHS model using k " 3 and training for 100 epochs using NVIDIA P100 GPU.

Results for runtime speedup quantification using symbolic as compared to dense matrices for different k " 1 ´30. These results are for the GCN-dDGM-EHS model and training for 100 epochs using a Tesla T4 GPU.

Summary of properties for Aerothermodynamics dataset

Results for Cora, CiteSeer, and PubMed using more than one dDGM ˚(taking a pointcloud as input) latent graph inference module.

Results for Cora, CiteSeer, and PubMed using more than one dDGM (leveraging the original graph connectivity structure) latent graph inference module.

Results for Physics and CS using two dDGM latent graph inference modules with a product manifold combining Euclidean, hyperbolic, and spherical model spaces.

Results for classical homophilic datasets combining GCN diffusion layers with the dDGM ˚and dDGM latent graph inference system and using single model spaces to construct the latent graphs.

Results for classical homophilic datasets combining GCN diffusion layers with the dDGM ˚module and using product manifolds to construct the latent graphs.

Results for classical homophilic datasets combining GCN diffusion layers with the dDGM module and using product manifold to construct the latent graphs.

Results using more complex product manifolds for the CS dataset. We multiply up to five model spaces to generate the product manifolds.

Results for heterophilic datasets combining GCN diffusion layers with the dDGM ˚latent graph inference system. We display results using model spaces as well as product manifolds to construct the latent graphs.

Results for OGB-Arxiv dataset using GATv2 diffusion layers and different latent graph inference modules.

Results for OGB-Products dataset using GATv2 diffusion layers and different latent graph inference modules. INDUCTIVE LEARNING: RESULTS FOR THE QM9 AND ALCHEMY DATASETSThe datasets discussed in the main task were solely concerned with tranductive learning. For completeness, we show that the latent graph inference system based on product manifolds is also applicable to inductive learning. Molecules are naturally represented as graphs, with atoms as nodes and bonds as edges. Prediction of molecular properties is a popular application of GNNs in chemistry Wieder et al. (2020); Stark et al. (2021); Li et al. (2021); Godwin et al. (2021); Zhang et al. (2021). Specifically, we work with the QM9 (Ramakrishnan et al. (2014); Ruddigkeit et al. (2012)) and Alchemy (Morris et al. (2020)) datasets which are well known in the Geometric Deep Learning literature. Table20displays the results. These tasks are substantially different to the ones previously discussed because they involve inductive learning and regression, whereas before all tasks focused on transductive learning and multi-class classification.

Results for the QM9 and Alchemy datasets using the dDGM module.

C TRAINING PROCEDURE

In this appendix we discuss key concepts to train the dDGM module. We detail how to backpropagate through the discrete sampling method and introduce an additional loss term to make this possible. We also we provide additional implementation details for updating learnable distancemetric-scaling coefficients and dealing with distance functions during training. Lastly, we describe the two approaches used in this work to make training computationally tractable for larger graphs: symbolic handling of distance metrics and graph subsampling.

C.1 BACKPROPAGATION THROUGH THE DDGM

The baseline node feature learning part of the architecture is optimized based on the downstream task loss: for classification we use the cross-entropy loss and for regression the mean squared error loss. Nevertheless, we must also update the graph learning dDGM parameters. To do so, we follow the approach proposed by Kazi et al. (2022) and apply a compound loss that rewards edges involved in a correct classification and penalizes edges which result in misclassification. We define the reward function, δ py i , ŷi q " Epac i q ´ac i (28) as the difference between the average accuracy of the ith sample and the current prediction accuracy, where y i and ŷi are the predicted and true labels, and ac i " 1 if y i " ŷi or 0 otherwise. Based on δ py i , ŷi q we obtain the loss employed to update the graph learning moduleL GL 's gradient approximates the gradient of the expectation E pG p1q ,...,G pLq q"pP p1q ,..,P pLq qwith respect to the parameters of the graphs in all the layer θ GL . So that,E pG p1q ,...,G pLq q"pP p1q ,..,P pLq qThe expectation Epac i q ptq is calculated based onwith β " 0.9 and Epac i q pt"0q " 0.5. For regression we use the R2 score instead of the accuracy.

C.2 UPDATING LEARNABLE DISTANCE-METRIC-SCALING COEFFICIENTS

The scaling metrics are learnable, so that we can indirectly adjust the curvature of each model space without having to backpropagate through the exponential map functions and the formulas for the distances. If we simply take the derivative of the graph loss and update the distance-metric-scaling coefficients during training α ptq " α pt´1q ´lr BL GL Bα pt´1q , (lr being the learning rate) this can result in negative values for the coefficients that multiply the distances in different model spaces, which would be mathematically incorrect since distances are by definition positive or zero. To solve this issue we learn α, instead of α. The two are related by the following equation, α " S pαq , where S is the sigmoid function. So that, α ptq " Spα ptq q " S `α pt´1q ´lr BL GL B αpt´1q ˘, which means that Published as a conference paper at ICLR 2023 Lastly, we display the learned latent graphs using original heterophilic graph as inductive bias for both the Texas and Wisconsin datasets. In Figure 12 and Figure 13 , we compare the final inferred latent graphs for the datasets when the original heterophilic dataset graph is used as inductive bias, against starting from a pointcloud. Models that use the dDGM module, which makes use of the original graph, are not able to achieve high homophily levels as compared to those using the dDGM module and ignoring the original graph. This explains the difference in performance in Table 17 from Appendix E.4. 

G MODEL ARCHITECTURES

For reproducibility, in this appendix we include a summary of all the models used to obtain the results discussed in this paper. This includes the GNN network archictectures as well as the internal structure of the dDGM modules.

G.1 NETWORK ARCHITECTURES

In this section we provide the neural network architectures used for the experimental validation as well as other training specifications.

G.1.1 NETWORKS FOR CLASSICAL GRAPH DATASETS

All neural network models used for homophilic graph datasets, MLP, GCN, and GCN-dDGMs, follow the architecture depicted in Table 21 . We apply a learning rate of lr " 10 ´2 and a weight decay of wd " 10 ´4. Models are trained for about 1,500 epochs.Published as a conference paper at ICLR 2023 The networks used for heterophilic datasets follow the architecture summarized in Table 22 . We apply lr " 10 ´2 and wd " 10 ´3, and we train the models for about 1,000 epochs. 

G.1.3 NETWORKS FOR OGB-ARXIV

Table 23 shows the networks used for the OGB-Arxiv dataset. We use lr " 10 ´3, wd " 0, and train for 100 epochs. For graph subsampling, we sample up to 1,000 neighbours per node and use a batch size of 1,000. 

G.1.4 NETWORKS FOR OGB-PRODUCTS

Table 24 shows the networks used for the OGB-Products dataset. We use lr " 10 ´2, wd " 0, and train for 30 epochs. For graph subsampling, we sample up to 200 neighbors per node and use a batch size of 1,000.Published as a conference paper at ICLR 2023 and wd " 0, and train for about 1,000 epochs. 

