WASSERSTEIN EMBEDDING FOR GRAPH LEARNING

Abstract

We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein distance to measure the dissimilarity between node embeddings of different graphs. Unlike prior work, we avoid pairwise calculation of distances between graphs and reduce the computational complexity from quadratic to linear in the number of graphs. WEGL calculates Monge maps from a reference distribution to each node embedding and, based on these maps, creates a fixed-sized vector representation of the graph. We evaluate our new graph embedding approach on various benchmark graph-property prediction tasks, showing state-of-the-art classification performance while having superior computational efficiency. The code is available at https://github.com/navid-naderi/WEGL.

1. INTRODUCTION

Many exciting and practical machine learning applications involve learning from graph-structured data. While images, videos, and temporal signals (e.g., audio or biometrics) are instances of data that are supported on grid-like structures, data in social networks, cyber-physical systems, communication networks, chemistry, and bioinformatics often live on irregular structures (Backstrom & Leskovec, 2011; Sadreazami et al., 2017; Jin et al., 2017; Agrawal et al., 2018; Naderializadeh et al., 2020) . One can represent such data as (attributed) graphs, which are universal data structures. Efficient and generalizable learning from graph-structured data opens the door to a vast number of applications, which were beyond the reach of classic machine learning (ML) and, more specifically, deep learning (DL) algorithms. Analyzing graph-structured data has received significant attention from the ML, network science, and signal processing communities over the past few years. On the one hand, there has been a rush toward extending the success of deep neural networks to graph-structured data, which has led to a variety of graph neural network (GNN) architectures. On the other hand, the research on kernel approaches (Gärtner et al., 2003) , perhaps most notably the random walk kernel (Kashima et al., 2003) and the Weisfeiler-Lehman (WL) kernel (Shervashidze et al., 2011; Rieck et al., 2019; Morris et al., 2019; 2020) , remains an active field of study and the methods developed therein provide competitive performance in various graph representation tasks (see the recent survey by Kriege et al. (2020) ). To learn graph representations, GNN-based frameworks make use of three generic modules, which provide i) feature aggregation, ii) graph pooling (i.e., readout), and iii) classification (Hu* et al., 2020) . The feature aggregator provides a vector representation for each node of the graph, referred to as a node embedding. The graph pooling module creates a representation for the graph from its node embeddings, whose dimensionality is fixed regardless of the underlying graph size, and which can then be analyzed using a downstream classifier of choice. On the graph kernel side, one leverages a kernel to measure the similarities between pairs of graphs, and uses conventional kernel methods to perform learning on a set of graphs (Hofmann et al., 2008) . A recent example of such methods is the framework provided by Togninalli et al. (2019) , in which the authors propose a novel node embedding inspired by the WL kernel, and combine the resulting node embeddings with the Wasserstein distance (Villani, 2008; Kolouri et al., 2017) to measure the dissimilarity between two graphs. Afterwards, they leverage conventional kernel methods based on the pairwise-measured dissimilarities to perform learning on graphs. Considering the ever-increasing scale of graph datasets, which may contain tens of thousands of graphs or millions to billions of nodes per graph, the issue of scalability and algorithmic efficiency becomes of vital importance for graph learning methods (Hernandez & Brown, 2020; Hu et al., 2020) . However, both of the aforementioned paradigms of GNNs and kernel methods suffer in this sense. On the GNN side, acceleration of the training procedure is challenging and scales poorly as the graph size grows (Bojchevski et al., 2019) . On the graph kernel side, the need for calculating the matrix of all pairwise similarities can be a burden in datasets with a large number of graphs, especially if calculating the similarity between each pair of graphs is computationally expensive. For instance, in the method proposed in (Togninalli et al., 2019) , the computational complexity of each calculation of the Wasserstein distance is cubic in the number of nodes (or linearithmic for the entropy-regularized distance). To overcome these issues, inspired by the linear optimal transport framework of (Wang et al., 2013) , we propose a linear Wasserstein Embedding for Graph Learning, which we refer to as WEGL. Our proposed approach embeds a graph into a Hilbert space, where the 2 distance between two embedded graphs provides a true metric between the graphs that approximates their 2-Wasserstein distance. For a set of M graphs, the proposed method provides: 1. Reduced computational complexity of estimating the graph Wasserstein distance (Togninalli et al., 2019) for a dataset of M graphs from a quadratic complexity in the number of graphs, i.e., M (M -foot_0) 2 calculations, to linear complexity, i.e., M calculations of the Wasserstein distance; and 2. An explicit Hilbertian embedding for graphs, which is not restricted to kernel methods, and therefore can be used in conjunction with any downstream classification framework. We show that compared to multiple GNN and graph kernel baselines, WEGL achieves either stateof-the-art or competitive results on benchmark graph-level classification tasks, including classical graph classification datasets (Kersting et al., 2020) and the recent molecular property-prediction benchmarks (Hu et al., 2020) . We also compare the algorithmic efficiency of WEGL with two baseline GNN and graph kernel methods and demonstrate that it is much more computationally efficient relative to those algorithms.

2. BACKGROUND AND RELATED WORK

In this section, we provide a brief background on different methods for deriving representations for graphs and an overview on Wasserstein distances by reviewing the related work in the literature.

2.1. GRAPH REPRESENTATION METHODS

Let G = (V, E) denote a graph, comprising a set of nodes V and a set of edges E ⊆ V 2 , where two nodes u, v ∈ V are connected to each other if and only if (u, v) ∈ E. 1 For each node v ∈ V, we define its set of neighbors as N v {u ∈ V : (u, v) ∈ E}. The nodes of the graph G may have categorical labels and/or continuous attribute vectors. We use a unified notation of x v ∈ R F to denote the label and/or attribute vector of node v ∈ V, where F denotes the node feature dimensionality. Moreover, we use w uv ∈ R E to denote the edge feature vector for any edge (u, v) ∈ E, where E denotes the edge feature dimensionality. Node and edge features may be present depending on the graph dataset under consideration. To learn graph properties from the graph structure and its node/edge features, one can use a function ψ : G → H to map any graph G in the space of all possible graphs G to an embedding ψ(G) in a Hilbert space H. Kernel methods have been among the most popular ways of creating such graph embeddings. A graph kernel is defined as a function k : G 2 → R, where for two graphs G and G , k(G, G ) represents the inner product of the embeddings ψ(G) and ψ(G ) over the Hilbert space H. The mapping ψ could be explicit, as in graph convolutional neural networks, or implicit as in the case of the kernel similarity function k(•, •) (i.e., the kernel trick). Kriege et al. (2014) provide a thorough discussion on explicit and implicit embeddings for learning from graphs. Kashima et al. (2003) introduced graph kernels based on random walks on labeled graphs. Subsequently, shortest-path kernels were introduced in (Borgwardt & Kriegel, 2005) . These works have been followed by graphlet and Weisfeiler-Lehman subtree kernel methods (Shervashidze et al., 2009; 2011; Morris et al., 2017) . More recently, kernel methods using spectral approaches (Kondor & Pan, 2016) , assignment-based approaches (Kriege et al., 2016; Nikolentzos et al., 2017) , and graph decomposition algorithms (Nikolentzos et al., 2018) have also been proposed in the literature. Despite being successful for many years, kernel methods often fail to leverage the explicit continuous features that are provided for the graph nodes and/or edges, making them less adaptable to the underlying data distribution. To alleviate these issues, and thanks in part to the prominent success of deep learning in many domains, including computer vision and natural language processing, techniques based on graph neural networks (GNNs) have emerged as an alternative paradigm for learning representations from graph-based data. In general, a GNN comprises multiple hidden layers, where at each layer, each node combines the features of its neighboring nodes in the graph to derive a new feature vector. At the GNN output, the feature vectors of all nodes are aggregated using a readout function (such as global average pooling), resulting in the final graph embedding ψ(G). More details on the combining and readout mechanisms of GNNs are provided in Appendix A.4. Kipf and Welling (Kipf & Welling, 2016) proposed a GNN architecture based on a graph convolutional network (GCN) framework. This work, alongside other notable works on geometric deep learning (Defferrard et al., 2016) , initiated a surge of interest in GNN architectures, which has led to several architectures, including the Graph Attention network (GAT) (Veličković et al., 2017) , Graph SAmple and aggreGatE (GraphSAGE) (Hamilton et al., 2017) , and the Graph Isomorphism Network (GIN) (Xu et al., 2019) . Each of these architectures modifies the GNN combining and readout functions and demonstrates state-of-the-art performance in a variety of graph representation learning tasks.

2.2. WASSERSTEIN DISTANCES

Let µ i denote a Borel probability measure with finite p th moment defined on Z ⊆ R d , with corresponding probability density function p i , i.e., dµ i (z) = p i (z)dz. The 2-Wasserstein distance between µ i and µ j defined on Z, Z ⊆ R d is the solution to the optimal mass transportation problem with 2 transport cost (Villani, 2008) : W 2 (µ i , µ j ) = inf γ∈Γ(µi,µj ) Z×Z z -z 2 dγ(z, z ) 1 2 , where Γ(µ i , µ j ) is the set of all transportation plans γ ∈ Γ(µ i , µ j ) such that γ(A × Z ) = µ i (A) and γ(Z × B) = µ j (B) for any Borel subsets A ⊆ Z and B ⊆ Z . Due to Brenier's theorem (Brenier, 1991) , for absolutely continuous probability measures µ i and µ j (with respect to the Lebesgue measure), the 2-Wasserstein distance can be equivalently obtained from W 2 (µ i , µ j ) = inf f ∈M P (µi,µj ) Z z -f (z) 2 dµ i (z) 1 2 , ( ) where M P (µ i , µ j ) = {f : Z → Z | f # µ i = µ j } and f # µ i represents the pushforward of measure µ i , characterized as B dµ j (z ) = f -1 (B) dµ i (z) for any Borel subset B ⊆ Z . (3) The mapping f is referred to as a transport map (Kolouri et al., 2017) , and the optimal transport map is called the Monge map. For absolutely continuous measures, the differential form of the above equation takes the following form, det(Df (z))p j (f (z)) = p i (z), which is referred to as the Jacobian equation. For discrete probability measures, when the transport plan γ is a deterministic optimal coupling, such a transport plan is referred to as a Monge coupling (Villani, 2008) . Recently, Wasserstein distances have been used for representation learning on graphs and images (Togninalli et al., 2019; Zhang et al., 2020; Bécigneul et al., 2020) . In particular, (Togninalli et al., 2019) proposed a Wasserstein kernel for graphs that involves pairwise calculation of the Wasserstein distance between graph representations. Pairwise calculation of the Wasserstein distance, however, could be expensive, especially for large graph datasets. In what follows, we apply the linear optimal transportation framework (Wang et al., 2013) to define a Hilbertian embedding, in which the 2 distance provides a true metric between the probability measures that approximates W 2 . We show that in a dataset containing M graphs, this framework reduces the computational complexity from calculating M (M -1) 2 linear programs to M . Wang et al. ( 2013) and the follow-up works (Seguy & Cuturi, 2015; Kolouri et al., 2016; Courty et al., 2018) describe frameworks for isometric Hilbertian embedding of probability measures such that the Euclidean distance between the embedded images approximates W 2 . We leverage the prior work and introduce the concept of linear Wasserstein embedding for learning graph embeddings.

3.1. THEORETICAL FOUNDATION

We adhere to the definition of the linear Wasserstein embedding for continuous measures. However, all derivations hold for discrete measures as well. More precisely, let µ 0 be a reference probability measure defined on Z ⊆ R d , with a positive probability density function p 0 , s.t. dµ 0 (z) = p 0 (z)dz and p 0 (z) > 0 for ∀z ∈ Z. Let f i denote the Monge map that pushes µ 0 into µ i , i.e., f i = argmin f ∈M P (µ0,µi) Z z -f (z) 2 dµ 0 (z). (4) Define φ(µ i ) (f i -id) √ p 0 , where id(z) = z, ∀z ∈ Z is the identity function. In cartography, such a mapping is known as the equidistant azimuthal projection, while in differential geometry, it is called the logarithmic map. The mapping φ(•) has the following characteristics (partially illustrated in Figure 1 ): 1. φ(•) provides an isometric embedding for probability measures, i.e., using the Jacobian equation p i = det(Df -1 i )p 0 (f -1 i ), where f i = φ(µi) √ p0 + id. 2. φ(µ 0 ) = 0, i.e., the reference is mapped to zero.

3.. φ(µ

i ) -φ(µ 0 ) 2 = φ(µ i ) 2 = W 2 (µ i , µ 0 ), i.e. , the mapping preserves distances to µ 0 .

4.. φ(µ

i ) -φ(µ j ) 2 ≈ W 2 (µ i , µ j ), i.e. , the 2 distance between φ(µ i ) and φ(µ j ), while being a true metric between µ i and µ j , is an approximation of W 2 (µ i , µ j ). Embedding probability measures {µ i } M i=1 via φ(•) requires calculating M Monge maps. The fourth characteristic above states that φ(•) provides a linear embedding for the probability measures. Therefore, we call it the linear Wasserstein embedding. The mapping φ(µ i ) could be thought as the Reproducing Kernel Hilbert Space (RKHS) embedding of the measure, µ i (Muandet et al., 2017) . In practice, for discrete distributions, the Monge coupling is used, which could be approximated from the Kantorovich plan (i.e., the transport plan) via the so-called barycenteric projection (Wang et al., 2013) . We here acknowledge the concurrent work by Mialon et al. (2021) , where the authors use a similar idea to the linear Wasserstein embedding as a pooling operator for learning from sets of features. The authors demonstrate the relationship between their proposed optimal transport-based pooling operation and the widespread attention pooling methods in the literature. A detailed description of the capabilities of the linear Wasserstein embedding framework is included in Appendix A.1. We next provide the numerical details of the barycenteric projection (Ambrosio et al., 2008; Wang et al., 2013) .

3.2. NUMERICAL DETAILS

Consider a set of probability distributions {p i } M i=1 , and let Z i = z i 1 , . . . , z i Ni T ∈ R Ni×d be an array containing N i i.i.d. samples from distribution p i , i.e., z i k ∈ R d ∼ p i , ∀k ∈ {1, . . . , N i }. Let us define p 0 to be a reference distribution, with Z 0 = z 0 1 , . . . , z 0 N T ∈ R N ×d , where N = 1 M M i=1 N i and z 0 j ∈ R d ∼ p 0 , ∀j ∈ {1, . . . , N }. The optimal transport plan between p i and p 0 , denoted by π * i ∈ R N ×Ni , is the solution to the following linear program, π * i = argmin π∈Πi N j=1 Ni k=1 π jk z 0 j -z i k 2 , where Π i π ∈ R N ×Ni N i N j=1 π jk = N Ni k=1 π jk = 1, ∀k ∈ {1, . . . , N i }, ∀j ∈ {1, . . . , N } . The Monge map is then approximated from the optimal transport plan by barycentric projection via F i = N (π * i Z i ) ∈ R N ×d . ( ) Note that the transport plan π i could split the mass in z 0 j ∈ Z 0 and distribute it on z i k s. The barycentric projection calculates the center of mass of the transportation locations for z 0 j to ensure that no mass splitting is happening (see Figure 5 ), and hence it approximates a Monge coupling. Finally, the embedding can be calculated by φ (Z i ) = (F i -Z 0 )/ √ N ∈ R N ×d . With a slight abuse of notation, we use φ(p i ) and φ(Z i ) interchangeably throughout the paper. Due to the barycenteric projection, here, φ(•) is only pseudo-invertible.

4. WEGL: A LINEAR WASSERSTEIN EMBEDDING FOR GRAPHS

The application of the optimal transport problem to graphs is multifaceted. For instance, some works focus on solving the "structured" optimal transport concerning an optimal probability flow, where the transport cost comes from distances on an often unchanging underlying graph (Léonard et al., 2016; Essid & Solomon, 2018; Titouan et al., 2019) . Here, we are interested in applying optimal transport to measure the dissimilarity between two graphs (Maretic et al., 2019; Togninalli et al., 2019; Dong & Sawin, 2020) . Our work significantly differs from (Maretic et al., 2019; Dong & Sawin, 2020) , which measure the dissimilarity between non-attributed graphs based on distributions defined by their Laplacian spectra and is closer to (Togninalli et al., 2019) . Our proposed graph embedding framework, termed Wasserstein Embedding for Graph Learning (WEGL), combines node embedding methods for graphs with the linear Wasserstein embedding explained in Section 3. More precisely, let {G i = (V i , E i )} M i=1 denote a set of M individual graphs, each with a set of possible node features {x v } v∈Vi and a set of possible edge features {w uv } (u,v)∈Ei .

Let h(•) be an arbitrary node embedding process, where h(G

i ) = Z i = z 1 , . . . , z |Vi| T ∈ R |Vi|×d . Having the node embeddings {Z i } M i=1 , we can then calculate a reference node embedding Z 0 (see Section 4.2 for details), which leads to the linear Wasserstein embedding φ(Z i ) with respect to Z 0 , as described in Section 3. Therefore, the entire embedding for each graph  G i , i ∈ {1, . . . , M }, is obtained by composing φ(•) and h(•), i.e., ψ(G i ) = φ(h(G i )).

4.1. NODE EMBEDDING

There are many choices for node embedding methods (Chami et al., 2020) . These methods in general could be parametric or non-parametric, e.g., as in propagation/diffusion-based embeddings. Parametric embeddings are often implemented via a GNN encoder. The encoder can capture different graph properties depending on the type of supervision (e.g., supervised or unsupervised). Selfsupervised embedding methods have also been recently shown to be promising (Hu* et al., 2020) . In this paper, for our node embedding process h(•), we follow a similar non-parametric propagation/diffusion-based encoder as in (Togninalli et al., 2019) . One of the appealing advantages of this framework is its simplicity, as there are no trainable parameters involved. In short, given a graph G = (V, E) with node features {x v } v∈V and scalar edge features {w uv } (u,v)∈E , we use the following instantiation of equation 12 to define the combining function as x (l) v = u∈Nv∪{v} w uv deg(u)deg(v) • x (l-1) u , ∀l ∈ {1, . . . , L}, ∀v ∈ V, where for any node v ∈ V, its degree deg(v) is defined as its number of neighbors in G augmented with self-connections, i.e., deg(v) 1 + |N v |. Note that the normalization of the messages between graph nodes by the (square root of) the two end-point degrees in equation 7 have also been used in other architectures, including GCN (Kipf & Welling, 2016) . For the cases where the edge weights are not available, including self-connection weights {w vv } v∈V , we set them to one. In Appendix A.5, we show how we use an extension of equation 7 to treat graphs with multiple edge features/labels. Finally, we let z v = g {x (l) v } L l=0 represent the resultant embedding for each node v ∈ V, where g(•) is a local pooling process on a single node (not a global pooling), e.g., concatenation or averaging.

4.2. CALCULATION OF THE REFERENCE DISTRIBUTION

To calculate the reference distribution, we use the k-means clustering algorithm on M i=1 Z i with N = 1 M M i=1 N i centroids. Alternatively, one can calculate the Wasserstein barycenter (Cuturi & Doucet, 2014) of the node embeddings or simply use N samples from a normal distribution. While approximation of the 2-Wasserstein distance in the tangent space depends on the reference distribution choice, surprisingly, we see a stable performance of WEGL for different references. In Appendix A.2, we compare the performance of WEGL with respect to different references.

5. EXPERIMENTAL EVALUATION

In this section, we discuss the evaluation results of our proposed algorithm on multiple benchmark graph classification datasets. We use the PyTorch Geometric framework (Fey & Lenssen, 2019) for implementing WEGL. In all experiments, we use scikit-learn for the implementation of our downstream classifiers on the embedded graphs (Buitinck et al., 2013) .

5.1. MOLECULAR PROPERTY PREDICTION ON THE OPEN GRAPH BENCHMARK

We first evaluate our algorithm on the molecular property prediction task on the ogbg-molhiv dataset. This dataset is part of the Open Graph Benchmark (Hu et al., 2020) , which involves node-level, link-level, and graph-level learning and prediction tasks on multiple datasets spanning diverse problem domains. The ogbg-molhiv dataset, in particular, is a molecular tree-like dataset, consisting of 41, 127 graphs, with an average number of 25.5 nodes and 27.5 edges per graph. Each graph is a molecule, with nodes representing atoms and edges representing bonds between them, and

Method

Validation ROC-AUC (%) Test ROC-AUC (%) GCN (Kipf & Welling, 2016) 83.8 ± 0.9 76.0 ± 1.2 GIN + Virtual Node (Xu et al., 2019) 84.8 ± 0.7 77.1 ± 1.5 DeeperGCN (Li et al., 2020) 84.3 ± 0.6 78.6 ± 1.2 HIMP (Fey et al., 2020) -78.8 ± 0.8 GNN GCN + GraphNorm (Cai et al., 2020) 79.0 ± it includes both node and edge attributes, characterizing the atom and bond features. The goal is to predict a binary label indicating whether or not a molecule inhibits HIV replication. To train and evaluate our proposed method, we use the scaffold split provided by the dataset, and report the mean and standard deviation of the results across 10 different random seeds. We perform a grid search over a set of hyperparameters and report the configuration that leads to the best validation performance. We also report a virtual node variant of the graphs in our evaluations, where each graph is augmented with an additional node that is connected to all the original nodes in the graph. This node serves as a shortcut for message passing among the graph nodes, bringing any pair of nodes within at most two hops of each other. The complete implementation details can be found in Appendix A.5. Table 1 shows the evaluation results of WEGL on the ogbg-molhiv dataset in terms of the ROC-AUC (i.e., Receiver Operating Characteristic Area Under the Curve), alongside multiple GNNbased baseline algorithms. Specifically, we show the results using two classifiers: A random forest classifier (Breiman, 2001) and an automated machine learning (AutoML) classifier using the Auto-Sklearn 2.0 library (Feurer et al., 2020) . As the table demonstrates, while WEGL embeddings combined with random forest achieve a decent performance level, using AutoML further enhances the performance and achieves state-of-the-art test results on this dataset, showing the high expressive power of WEGL in large-scale graph datasets, without the need for end-to-end training. Moreover, as an ablation study, we report the evaluation results on the ogbg-molhiv dataset using a random forest classifier, where the Wasserstein embedding module after the node embedding process is replaced with global average pooling (GAP) among the output node embeddings of each graph to derive the graph-level embedding. As the table demonstrates, there is a significant performance drop when using GAP graph embedding, which indicates the benefit of our proposed graph embedding method as opposed to average readout.

5.2. TUD BENCHMARK DATASETS

We also consider a set of social network, bioinformatics and molecule graph datasets (Kersting et al., 2020) . The social network datasets (IMDB-BINARY, IMDB-MULTI, COLLAB, REDDIT-BINARY, and REDDIT-MULTI-5K) lack both node and edge features. Therefore, in these datasets we use a one-hot representation of the node degrees as their initial feature vectors, as also used in prior work, e.g., (Xu et al., 2019) . To handle the large scale of the REDDIT-BINARY and REDDIT-MULTI-5K and datasets, we clip the node degrees at 500. Moreover, for the molecule (PTC-MR) and bioinformatics (ENZYMES and PROTEINS) datasets, we use the readily-provided node labels in (Kersting et al., 2020) as the initial node feature vectors. Besides, for PTC-MR which has edge labels, as explained in Appendix A.5, we use an extension of equation 7 to use the one-hot encoded edge features in the diffusion process. To evaluate the performance of WEGL, we follow the methodology used in (Yanardag & Vishwanathan, 2015; Niepert et al., 2016; Xu et al., 2019) , where for each dataset, we perform 10-fold cross-validation with random splitting on the entire dataset, conducting a grid search over the desired set of hyperparameters as mentioned in Appendix A.5, and we then report the mean and standard deviation of the validation accuracies achieved during cross-validation. For this experiment, we use two ensemble classifiers, Method IMDB-B IMDB-M COLLAB RE-B RE-M5K PTC-MR ENZYMES PROTEINS DGCNN (Zhang et al., 2018a) 69.2±3.0 45.6±3.4 71.2±1.9 87.8±2.5 49.2±1.2 58.6 38.9±5.7 72.9±3.5 GraphSAGE (Hamilton et al., 2017) 68.8±4.5 47.6±3.5 73.9±1.7 84.3±1.9 50.0±1.3 63.9±7.7 -75.9±3.2 GIN (Xu et al., 2019) 75 namely Random Forest and Gradient Boosted Decision Tree (GBDT), together with kernel-SVM with an RBF kernel (SVM-RBF). Given that the Euclidean distance in the embedding space approximates the 2-Wasserstein distance, the SVM-RBF classification results are comparable with those reported by Togninalli et al. (2019) . Table 2 shows the classification accuracies achieved by WEGL on the aforementioned datasets as compared with several GNN and graph kernel baselines, whose results are extracted from the corresponding original papers. As the table demonstrates, our proposed algorithm achieves either state-of-the-art or competitive results across all the datasets, and in particular, it is among the top-three performers in all of them. This shows the effectiveness of the proposed linear Wasserstein embedding for learning graph-level properties across different domains.

5.3. COMPUTATION TIME

As mentioned before, one of the most important advantages of WEGL as compared to other graph representation learning methods is its algorithmic efficiency. To evaluate that, we compare the wall-clock training and inference times of WEGL with those of GIN and the Wasserstein Weisfeiler-Lehman (WWL) graph kernel on five different TUD datasets (IMDB-B, MUTAG, PTC-MR, PROTEINS, and NCI1)foot_1 . For WEGL and WWL, we use the exact linear programming solver (as opposed to the entropy-regularized version). We carry out our experiments for WEGL and WWL on a 2.3 GHz Intel R Xeon R E5-2670 v3 CPU, while we use a 16 GB NVIDIA R Tesla R P100 GPU for GIN. As a reference, we implement GIN on the aforementioned CPU hardware as well. Figure 3 shows how the training and inference run-times of WEGL, WWL and GIN scale with the number of graphs in the dataset, the average number of nodes per graph, and the average number of edges per graph. As the figure illustrates, while having similar or even better performance, training WEGL is several orders of magnitude faster than WWL and GIN, especially for datasets with larger numbers of graphs. Note that training WWL becomes very inefficient as the number of graphs in the dataset increases due to the pairwise distance calculation between all the graphs. During inference, WEGL is slightly slower than GIN, when implemented on a GPU. However, on most datasets, WEGL is considerably faster in inference as compared to GIN implemented on a CPU. Moreover, both algorithms are significantly faster than WWL. Using GPU-accelerated implementations of the diffusion process in equation 7 and the entropy-regularized transport problem could potentially further enhance the computational efficiency of WEGL during inference. (Togninalli et al., 2019) and GIN (Xu et al., 2019) . WEGL and WWL were implemented on a 2.3 GHz Intel R Xeon R E5-2670 v3 CPU. Moreover, GIN was implemented twice, once using the aforementioned CPU and once using a 16 GB NVIDIA R Tesla R P100 GPU, with 5 hidden layers, 300 training epochs and a batch size of 16 in both cases.

6. CONCLUSION

We considered the problem of graph property prediction and introduced the linear Wasserstein Embedding for Graph Learning, which we denoted as WEGL. Similar to (Togninalli et al., 2019) , our approach also relies on measuring the Wasserstein distances between the node embeddings of graphs. Unlike (Togninalli et al., 2019) , however, we further embed the node embeddings of graphs into a Hilbert space, in which their Euclidean distance approximates their 2-Wasserstein distance. WEGL provides two significant benefits: i) it has linear complexity in the number of graphs (as opposed to the quadratic complexity of (Togninalli et al., 2019) ), and ii) it enables the application of any ML algorithm of choice, such as random forest, gradient boosted decision tree, or even AutoML. We demonstrated WEGL's superior performance and highly efficient training on a wide range of benchmark datasets, including the ogbg-molhiv dataset and the TUD graph classification tasks.

A APPENDIX

Here we provide further details on the theoretical aspect of WEGL, our implementation details, and the sensitivity of the results to the choice of reference distribution. We also share our implementation code on the ogbg-molhiv dataset to help with the review process.

A.1 DETAILED DISCUSSION ON LINEAR WASSERSTEIN EMBEDDING

The linear Wasserstein embedding used in WEGL is based on the Linear Optimal Transport (LOT) framework introduced in (Wang et al., 2013) . The main idea is to compute the "projection" of the manifold of probability measures to the tangent space at a fixed reference measure. In particular, the tangent space at measure µ 0 is the set of vector fields V µ0 = {v : Z → R d | Z |v(z)| 2 dµ 0 (z) < ∞} such that the inner product is the weighted 2 : v i , v j µ0 = Z v i (z) • v j (z)dµ 0 (z). ( ) We can then define v i (•) f (•) -id(•) , where f (•) is the optimal transport map from µ 0 to µ i . Note that v i ∈ V µ0 , v 0 = 0, and v i -v 0 2 µ0 = v i 2 µ0 = v i , v i µ0 = Z f (z) -z 2 dµ 0 (z) = W 2 (µ i , µ 0 ). In the paper, we use φ(µ i ) = v i √ p 0 to turn the weighted-2 into 2 . The discussion above assumes an absolutely continuous reference measure µ 0 . A more interesting treatment of the problem is via the generalized geodesics defined in (Ambrosio et al., 2008) , connecting µ i and µ j and enabling us to use discrete reference measures. Following the notation in (Ambrosio et al., 2008) , given the reference measure µ 0 , let Γ(µ i , µ 0 ) be the set of transport plans between µ i and µ 0 , and let Γ(µ i , µ j , µ 0 ) be the set of all measures on the product space Z × Z × Z such that the marginals over µ i and µ j are Γ(µ j , µ 0 ) and Γ(µ i , µ 0 ), respectively. Then the linearized optimal transport distance is defined as δ z l , we have d 2 LOT,µ0 (µ i , µ j ) = inf γ∈Γ(µi,µj ,µ0) Z×Z×Z z -z 2 dγ(z, z , z ). d 2 LOT,µ0 (µ i , µ j ) = min γ∈Γ(µi,µj ,µ0) 1 N i N j N Ni n=1 Nj m=1 N l=1 γ nml z n -z m 2 . See Figure 4a for a depiction of Equation equation 11's meaning. Finally, the idea of barycenteric projection used to approximate Monge couplings and provide a fixed-size representation is shown in Figure 4b . , where we only observe samples Zi = [z i k ∼ pi] N i i=1 from each distribution, together with the process of obtaining the linear Wasserstein embedding with respect to a reference distribution. In short, for each distribution pi, the embedding approximates the Monge-map (i.e., a vector field) from the reference samples Z0 to the target samples Zi by a barycentric projection of the optimal transport plan. Adding samples in the embedding space corresponds to adding their vector fields, which can be used to calculate (b) the mean distribution in the embedding space, i.e., φ -1 ( 1 M M i=1 φ(pi)) and (c)-(d) the Euclidean geodesics in this space, i.e., φ -1 (αφ(pi) + (1 -α)φ(pj)) for α ∈ [0, 1]. As can be seen, the calculated mean is the Wasserstein barycenter of the dataset, and the Euclidean geodesics in the embedding space follow the Wasserstein geodesics in the original space. ! "#$ % & % ' % ( % ( ) % ' ) % & ) % $ )) % & % ' % ( % ( ) % ' ) % & ) % $ )) %̅ $ %̅ $ ) + , + - + . + , + - + . Reference Reference / , $ [/ -] $ [/ -] $ -[/ , ] $ Next, to demonstrate the capability of the linear Wasserstein embedding, we present the following experiment. Consider a set of distributions {p i } M i=1 , where each p i is a translated and dilated ring distribution in R 2 , and N i samples are observed from p i , where N i and N j could be different for i = j. We then consider a normal distribution as the reference distribution and calculate the linear Wasserstein embedding with respect to the reference (See Figure 5a ). Given the pseudo-invertible nature of the embedding, to demonstrate the modeling capability of the framework, we calculate the mean in the embedding space (i.e., on the vector fields), and invert it to obtain the mean distribution p. Figure 5b shows the calculated mean, indicating that the linear Wasserstein embedding framework has successfully retrieved a ring distribution as the mean. Finally, we calculate Euclidean geodesics in the embedding space (i.e., the convex combination of the vector fields) between p i and p 0 , as well as between p i and p j , and show the inverted geodesics in Figures 5c and 5d , respectively. As the figures demonstrate, the calculated geodesics follow the Wasserstein geodesics.

APPROXIMATION ERROR OF THE EMBEDDING

Given the Euclidean distance in the embedding space is a transport-based distance (i.e., the socalled LOT distance) that approximates the Wasserstein distance, a natural question arises about the approximation error. Here we point the reader to the recent work of Moosmüller & Cloninger (2020) in which the authors provide bounds on how well the Euclidean distance in the embedding space approximates the 2-Wasserstein distance. In particular, the authors show that: W 2 (µ i , µ j ) ≤ φ(µ i ) -φ(µ j ) 2 ≤ W 2 (µ i , µ j ) + f µj µi -f µj µ0 • f µ0 µi µi , where f µj µi is the optimal transport map from µ i to µ j . This inequality simply indicates that the approximation error is caused by conditioning the transport map to be obtained by composition of the optimal transport maps from µ i to µ 0 , and then from µ 0 to µ j . More importantly, it can be shown that if µ i and µ j are shifted and scaled versions of the reference measure, µ 0 , then the embedding is isometric (See Figure 1 and 2 in Moosmüller & Cloninger (2020) ). Maybe a less interesting upper bound can also be obtained by the triangle inequality: W 2 (µ i , µ j ) ≤ φ(µ i ) -φ(µ j ) 2 ≤ W 2 (µ i , σ) + W 2 (σ, µ j ), which ensures some regularity of the embedding.

REGULARITY OF THE EMBEDDING

A good question was raised during the feedback period, about the regularity of the proposed embedding. The regularity of the graph embedding will depend on both the regularity of node-embedding and the Wasserstein embedding. In the following, we avoid the discussion on regularity of the nodeembedding (as it is not the main focus of our work), and focus on the regularity of the Wasserstein embedding. To that end, we first point out several regularity characteristics pointed out in Appendix A of Moosmüller & Cloninger (2020) . Most notably Theorem 4.2 in their paper, shows an almost isometric property when the distortions are within an -tube around the set of shifts and scalings. In short, let µ ∈ P 2 (Z), R > 0, > 0, E be the set of all shifts and scalings, and E µ,R = {h ∈ E : h µ ≤ R}, and let G µ,R, = {g ∈ L 2 (Z, µ) : ∃h ∈ E µ,R : g -h µ ≤ } , which is the -tube around set of shifts and scalings. Now, assume µ 0 ∈ P 2 (Z) is the reference measure and both µ and µ 0 satisfy Caffarelli's regularity theorem. Then for g 1 , g 2 ∈ G µ,R, we have 0 ≤ φ(g 1# µ) -φ(g 2# µ) 2 -W 2 (g 1# µ, g 2# µ) ≤ C µ0,µ,R + Cµ0,µ,R 2 , where C µ0,µ,R and Cµ0,µ,R are constants depending on µ 0 , µ, and R. The results shown above can be used to derive regularity results for the linear Wasserstein embedding with respect to additive noise. We know that the addition of two random variables leads to a new random variable with its PDF being the convolution of the original PDFs. Therefore, for features Z i = [z 1 , ..., z Ni ] T , let Ẑi = [z 1 + e 1 ẑ1 , ..., z Ni + e Ni ẑN i ] T denote the noisy samples for e i ∼ η, where η is the noise distribution. Then the noisy samples, ẑi , will be distributed according to μi = µ i * η. For instance, for the Gaussian additive noise, μi is the smoothed version of µ i . Therefore, there exists a transport map in g ∈ G µi,R, for which, μi = g # µ i and g -id µi = W 2 (μ i , µ i ) ≤ , and we have: 0 ≤ φ(µ i ) -φ(μ i ) 2 ≤ (C µ0,µi,R + 1) + Cµ0,µi,R 2 .

A.2 SENSITIVITY TO THE CHOICE OF REFERENCE DISTRIBUTION

To measure the dependency of WEGL on the reference distribution choice, we changed the reference to a normal distribution (i.e., data-independent). We compared the results of WEGL using the new reference distribution to that using a reference distribution calculated via k-means on the training set. We used the ogbg-molhiv dataset with initial node embedding of size 300 and 4 diffusion layers. We ran the experiment with 100 different random seeds, and measured the test ROC-AUC of WEGL calculated with the two aforementioned reference distributions. Figure 6 shows the results of this experiment, indicating that the choice of reference distribution is statistically insignificant. A.3 LINEAR PROGRAMMING VS. ENTROPY REGULARIZATION In this paper, we used the Python Optimal Transport (Flamary & Courty, 2017) for the calculation of the optimal transport plans. During the feedback period, a point came up regarding the entropyregularized version of the OT problem (Cuturi & Doucet, 2014) , which reduces the complexity of the linear programming problem from being cubic, in the number of nodes, to being quadratic, using the Sinkhorn algorithm. Given the Sinkhorn algorithm's iterative nature, the computational gain of the method is prominent when calculating the transportation problem between graphs with a large number of nodes (e.g., larger than 10 3 ). However, the graph datasets used in this paper often have a small number of nodes (e.g., < 50). In these settings, linear programming is efficient. To obtain any computational gain using the Sinkhorn algorithm, one would need to use a large regularization coefficient, which reduces the Sinkhorn algorithm's precision. Here we ran an experiment on the ogbg-molhiv dataset. We obtain the transport plans between the node embeddings and the reference distribution using the linear programming solver (using ot.emd2 from (Flamary & Courty, 2017) ) and the Sinkhorn algorithm for the entropy-regularized problem (using ot.sinkhorn2 from (Flamary & Courty, 2017) ). We measure the calculation time as well as the calculated distances for both algorithms. For the Sinkhorn algorithm, we used four different regularization values. We report the mean and standard deviation of calculation time ratio, i.e., 7 . As the figure shows, due to the small graph size, the LP solver is efficient and little to no gain can be obtained using the Sinkhorn algorithm. Nevertheless, in the case of dealing with larger graph sizes, the entropy regularized formulation should be the definite choice. Finally, for the entropy-regularized OT problem, more efficient solvers have been proposed that outperform the Sinkhorn algorithm (Dvurechensky et al., 2018) . However, given the acceptable performance of the linear programming solver (at least for the graph datasets in this paper), we did not find it necessary to seek more efficient solvers.

A.4 INNER WORKING OF GNNS

In its most general form, a GNN consists of L hidden layers, where at the l th layer, each node v ∈ V aggregates and combines messages from its 1-hop neighboring nodes N v , resulting in the feature vector x (l) v = Ψ combine x (l-1) v , (x (l-1) u , w uv ) u∈Nv , ∀l ∈ {1, . . . , L}, ∀v ∈ V, where Ψ combine (•) denotes a parametrized and differentiable combining function. At the input layer, each node v ∈ V starts with its initial feature vector  x 0 v = x v ∈ R F , ψ(G) = Ψ readout x (l) v v∈V,l∈{0,...,L} .

A.5 IMPLEMENTATION DETAILS

To derive the node embeddings, we use the diffusion process in equation 7 for the datasets without edge features/labels, i.e., all the social network datasets (IMDB-BINARY, IMDB-MULTI, COLLAB, REDDIT-BINARY, REDDIT-MULTI-5K, and REDDIT-MULTI-12K) and four of the molecule and bioinformatics datasets (ENZYMES, PROTEINS, D&D, and NCI1). We specifically set w uv = 1 for any (u, v) ∈ E and also for all self-connections, i.e., w vv = 1, ∀v ∈ V. The remaining datasets contain edge labels that cannot be directly used with equation 7. Specifically, each edge in the MUTAG and PTC-MR datasets has a categorical label, encoded as a one-hot vector of dimension four. Moreover, in the ogbg-molhiv dataset, each edge has three categorical indicating bond type (five categories), bond stereochemistry (six categories) and whether the bond is conjugated (two categories). We first convert each categorical feature to its one-hot representation, and then concatenate them together, resulting in a binary 13-dimensional feature vector for each edge. In each of the three aforementioned datasets, for any edge (u, v) ∈ E, let us denote its binary feature vector by w uv ∈ {0, 1} E , where E is equal to 4, 4, and 13 for MUTAG, PTC-MR, and ogbg-molhiv, respectively. We then use the following extension of the diffusion process in equation 7, x (l) v = u∈V E e=1 w uv,e deg e (u)deg e (v) x (l-1) u , ∀l ∈ {1, . . . , L}, ∀v ∈ V, where for any e ∈ {1, . . . , E}, w uv,e denotes the e th element of w uv , and for any node v ∈ V, we define deg e (v) as its degree over the e th elements of the edge features; i.e., deg e (v) u∈V w uv,e . We assign vectors of all-one features to the self-connections in the graph; i.e., w vv,e = 1, ∀v ∈ V, ∀e ∈ {1, . . . , E}. Note that the formulation of the diffusion process in equation 14 can be seen as an extension of equation 7, where the underlying graph with multi-dimensional edge features is broken into E parallel graphs with non-negative single-dimensional edge features, and the parallel graphs perform message passing at each round/layer of the diffusion process. For the ogbg-molhiv experiments in which virtual nodes were appended to the original molecule graphs, we set the initial feature vectors of all virtual nodes to all-zero vectors. Moreover, for any graph G i in the dataset with |V i | nodes, we set the edge features for the edge between the virtual node v virtual and each of the original graph nodes u ∈ V i as w uv virtual ,e = 1 |Vi| , ∀e ∈ {1, . . . , E}. The normalization by the number of graph nodes is included so as to regulate the degree of the virtual node used in equation 14. We also include the resultant embedding of the virtual node at the end of the diffusion process in the calculation of the graph embedding ψ(G i ). In the experiments conducted on each dataset, once the node embeddings are derived from the diffusion process, we standardize them by subtracting the mean embedding and dividing by the standard deviation of the embeddings, where the statistics are calculated based on all the graphs in the dataset. Moreover, to reduce the computational complexity of estimating the graph embeddings for the ogbg-molhiv dataset, we further apply a 20-dimensional PCA on the node embeddings. 



Note that this definition includes both directed and undirected graphs, where in the latter case, for each edge (u, v) ∈ E, the reverse edge (v, u) is also included in E. The complete classification results of WEGL on these datasets can be found in Appendix A.6.



Figure 1: Graphical representation of the linear Wasserstein embedding framework, where the probability distributions are mapped to the tangent space with respect to a fixed reference distribution. The figure is adapted from Kolouri et al. (2017).

Figure 2 visualizes this process.

Figure 2: Our proposed graph embedding framework, WEGL, combines node embedding methods with the linear Wasserstein embedding framework described in Section 3. Given a graph Gi = (Vi, Ei), we first embed the graph nodes into a d-dimensional Hilbert space and obtain an array of node embeddings, denoted by h(Gi) = Zi ∈ R |V i |×d . We then calculate the linear Wasserstein embedding of Zi with respect to a reference Z0, i.e., φ(Zi), to derive the final graph embedding.

Figure3: Average wall-clock time comparison of our proposed method, WEGL with WWL(Togninalli et al., 2019) and GIN(Xu et al., 2019). WEGL and WWL were implemented on a 2.3 GHz Intel R Xeon R E5-2670 v3 CPU. Moreover, GIN was implemented twice, once using the aforementioned CPU and once using a 16 GB NVIDIA R Tesla R P100 GPU, with 5 hidden layers, 300 training epochs and a batch size of 16 in both cases.

Figure 4: Illustration of (a) the meaning behind γ nml used in the LOT distance in equation 11, and (b) the idea of the barycenteric projection, which provides a fixed-size representation (i.e., of size N ).

Figure 5: An experiment demonstrating the capability of the linear Wasserstein embedding. (a) A simple dataset consisting of shifted and scaled noisy ring distributions {pi} M i=1, where we only observe samples Zi = [z i k ∼ pi] N i i=1 from each distribution, together with the process of obtaining the linear Wasserstein embedding with respect to a reference distribution. In short, for each distribution pi, the embedding approximates the Monge-map (i.e., a vector field) from the reference samples Z0 to the target samples Zi by a barycentric projection of the optimal transport plan. Adding samples in the embedding space corresponds to adding their vector fields, which can be used to calculate (b) the mean distribution in the embedding space, i.e., φ -1 ( 1

Figure 7: Comparing the performance of the linear programming (LP) solver with the Sinkhorn algorithm (on entropy regularized OT problem) on the ogbg-molhiv dataset for various regularization parameters.

We use the following set of hyperparameters to perform a grid search over in each of the experiments:• Random Forest: min_samples_leaf ∈ {1, 2, 5}, min_samples_split ∈ {2,5, 10}, and n_estimators ∈ {25, 50, 100, 150, 200}. • Gradient Boosted Decision Tree (GBDT): min_samples_leaf ∈ {1, 2, 5}, min_samples_split ∈ {2, 5, 10}, n_estimators ∈ {25, 50, 100, 150, 200}, and max_depth ∈ {1, 3, 5}. • SVM-Linear and SVM-RBF: C ∈ {10 -2 , . . . , 10 5 }. • Multi-Layer Perceptron (MLP): hidden_layer_sizes ∈ {(128), (256), (128, 64), (256, 128)}. • Auto-ML: Auto-Sklearn 2.0 searches over a space of 42 hyperparameters using Bayesian optimization techniques, as mentioned in Feurer et al. (2020). • Number of Diffusion Layers in equation 7 and equation 14: L ∈ {3, . . . , 8}. Classifier + Embedding Type IMDB-B IMDB-Graph classification accuracy (%) of the Wasserstein embedding based on different classifiers and the three types of node embeddings, on various TUD graph classification tasks.

Graph classification results on the ogbg-molhiv dataset. The results for GCN and GIN are reported from(Hu et al., 2020). The best validation and test results are shown in bold.

Graph classification accuracy (%) of our method and comparison with the state-of-the-art GNNs and graph kernels (GKs) on various TUD graph classification tasks. The results for DGCNN are reported from(Errica et al., 2020). The top-three performers on each dataset are shown in bold.

Figure6: ROC-AUC (%) results on ogbg-molhiv dataset, when the reference distribution is calculated by k-means (Section 4.2) on the training dataset (denoted as k-means), compared to when it is fixed to be a normal distribution (denoted as Normal). With a p-value= 0.05, the choice of the template is statistically insignificant.

and the sequential application of GNN layers, as in equation 12, computes intermediate feature vectors {x

ACKNOWLEDGEMENT

We gratefully acknowledge funding by the United States Air Force under Contract No. FA8750-19-C-0098. Gustavo K. Rohde also acknowledges funding by NIH grant GM130825. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force and DARPA.

annex

• Initial Node Feature Dimensionality (for ogbg-molhiv only): {100, 300, 500}.• Node Embedding Type: For a graph with F -dimensional initial node features, we consider using the following three types of node embedding:-Concat:, where denotes concatenation.-Average:

A.6 TUD BENCHMARK -COMPLETE RESULTS

Here, we report the comprehensive set of classification results of our proposed method, WEGL, for each of the node embedding types mentioned in Section A.5, using five different classifiers: Linear SVM, Kernel-SVM (SVM-RBF), Gradient Boosted Decision Trees (GBDT), Multi-Layer Perceptron (MLP), and Random Forest (RF). The results are shown in Table 3 .

