WASSERSTEIN EMBEDDING FOR GRAPH LEARNING

Abstract

We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein distance to measure the dissimilarity between node embeddings of different graphs. Unlike prior work, we avoid pairwise calculation of distances between graphs and reduce the computational complexity from quadratic to linear in the number of graphs. WEGL calculates Monge maps from a reference distribution to each node embedding and, based on these maps, creates a fixed-sized vector representation of the graph. We evaluate our new graph embedding approach on various benchmark graph-property prediction tasks, showing state-of-the-art classification performance while having superior computational efficiency. The code is available at https://github.com/navid-naderi/WEGL.

1. INTRODUCTION

Many exciting and practical machine learning applications involve learning from graph-structured data. While images, videos, and temporal signals (e.g., audio or biometrics) are instances of data that are supported on grid-like structures, data in social networks, cyber-physical systems, communication networks, chemistry, and bioinformatics often live on irregular structures (Backstrom & Leskovec, 2011; Sadreazami et al., 2017; Jin et al., 2017; Agrawal et al., 2018; Naderializadeh et al., 2020) . One can represent such data as (attributed) graphs, which are universal data structures. Efficient and generalizable learning from graph-structured data opens the door to a vast number of applications, which were beyond the reach of classic machine learning (ML) and, more specifically, deep learning (DL) algorithms. Analyzing graph-structured data has received significant attention from the ML, network science, and signal processing communities over the past few years. On the one hand, there has been a rush toward extending the success of deep neural networks to graph-structured data, which has led to a variety of graph neural network (GNN) architectures. On the other hand, the research on kernel approaches (Gärtner et al., 2003) , perhaps most notably the random walk kernel (Kashima et al., 2003) and the Weisfeiler-Lehman (WL) kernel (Shervashidze et al., 2011; Rieck et al., 2019; Morris et al., 2019; 2020) , remains an active field of study and the methods developed therein provide competitive performance in various graph representation tasks (see the recent survey by Kriege et al. ( 2020)). To learn graph representations, GNN-based frameworks make use of three generic modules, which provide i) feature aggregation, ii) graph pooling (i.e., readout), and iii) classification (Hu* et al., 2020) . The feature aggregator provides a vector representation for each node of the graph, referred to as a node embedding. The graph pooling module creates a representation for the graph from its node embeddings, whose dimensionality is fixed regardless of the underlying graph size, and which can then be analyzed using a downstream classifier of choice. On the graph kernel side, one leverages a kernel to measure the similarities between pairs of graphs, and uses conventional kernel methods to perform learning on a set of graphs (Hofmann et al., 2008) . A recent example of such methods is the framework provided by Togninalli et al. (2019) , in which the authors propose a novel node embedding inspired by the WL kernel, and combine the resulting node embeddings with the Wasserstein distance (Villani, 2008; Kolouri et al., 2017) to measure the dissimilarity between two graphs. Afterwards, they leverage conventional kernel methods based on the pairwise-measured dissimilarities to perform learning on graphs. Considering the ever-increasing scale of graph datasets, which may contain tens of thousands of graphs or millions to billions of nodes per graph, the issue of scalability and algorithmic efficiency becomes of vital importance for graph learning methods (Hernandez & Brown, 2020; Hu et al., 2020) . However, both of the aforementioned paradigms of GNNs and kernel methods suffer in this sense. On the GNN side, acceleration of the training procedure is challenging and scales poorly as the graph size grows (Bojchevski et al., 2019) . On the graph kernel side, the need for calculating the matrix of all pairwise similarities can be a burden in datasets with a large number of graphs, especially if calculating the similarity between each pair of graphs is computationally expensive. For instance, in the method proposed in (Togninalli et al., 2019) , the computational complexity of each calculation of the Wasserstein distance is cubic in the number of nodes (or linearithmic for the entropy-regularized distance). To overcome these issues, inspired by the linear optimal transport framework of (Wang et al., 2013) , we propose a linear Wasserstein Embedding for Graph Learning, which we refer to as WEGL. Our proposed approach embeds a graph into a Hilbert space, where the 2 distance between two embedded graphs provides a true metric between the graphs that approximates their 2-Wasserstein distance. For a set of M graphs, the proposed method provides: 1. Reduced computational complexity of estimating the graph Wasserstein distance (Togninalli et al., 2019) for a dataset of M graphs from a quadratic complexity in the number of graphs, i.e., M (M -foot_0) 2 calculations, to linear complexity, i.e., M calculations of the Wasserstein distance; and 2. An explicit Hilbertian embedding for graphs, which is not restricted to kernel methods, and therefore can be used in conjunction with any downstream classification framework. We show that compared to multiple GNN and graph kernel baselines, WEGL achieves either stateof-the-art or competitive results on benchmark graph-level classification tasks, including classical graph classification datasets (Kersting et al., 2020) and the recent molecular property-prediction benchmarks (Hu et al., 2020) . We also compare the algorithmic efficiency of WEGL with two baseline GNN and graph kernel methods and demonstrate that it is much more computationally efficient relative to those algorithms.

2. BACKGROUND AND RELATED WORK

In this section, we provide a brief background on different methods for deriving representations for graphs and an overview on Wasserstein distances by reviewing the related work in the literature.

2.1. GRAPH REPRESENTATION METHODS

Let G = (V, E) denote a graph, comprising a set of nodes V and a set of edges E ⊆ V 2 , where two nodes u, v ∈ V are connected to each other if and only if (u, v) ∈ E. 1 For each node v ∈ V, we define its set of neighbors as N v {u ∈ V : (u, v) ∈ E}. The nodes of the graph G may have categorical labels and/or continuous attribute vectors. We use a unified notation of x v ∈ R F to denote the label and/or attribute vector of node v ∈ V, where F denotes the node feature dimensionality. Moreover, we use w uv ∈ R E to denote the edge feature vector for any edge (u, v) ∈ E, where E denotes the edge feature dimensionality. Node and edge features may be present depending on the graph dataset under consideration. To learn graph properties from the graph structure and its node/edge features, one can use a function ψ : G → H to map any graph G in the space of all possible graphs G to an embedding ψ(G) in a Hilbert space H. Kernel methods have been among the most popular ways of creating such graph embeddings. A graph kernel is defined as a function k : G 2 → R, where for two graphs G and G , k(G, G ) represents the inner product of the embeddings ψ(G) and ψ(G ) over the Hilbert space H. The mapping ψ could be explicit, as in graph convolutional neural networks, or implicit as in the case



Note that this definition includes both directed and undirected graphs, where in the latter case, for each edge (u, v) ∈ E, the reverse edge (v, u) is also included in E.

