OPTIMAL TRANSPORT-BASED SUPERVISED GRAPH SUMMARIZATION

Abstract

Graph summarization is the problem of producing smaller graph representations of an input graph dataset, in such a way that the smaller compressed graphs capture relevant structural information for downstream tasks. One graph summarization method, recently proposed in Garg & Jaakkola (2019), formulates an optimal transport-based framework that allows prior information about node, edge, and attribute importance to be incorporated into the graph summarization process. We consider the problem of graph summarization in a supervised setting, wherein we seek to preserve relevant information about a class label. We first formulate this problem in terms of maximizing the Shannon mutual information between the summarized graph and the class label. We propose a method that incorporates mutual information estimates between random variables associated with sample graphs and class labels into the optimal transport compression framework. We empirically show performance improvements over previous works in terms of classification accuracy and time on synthetic and certain real datasets. We also theoretically explore the limitations of the optimal transport approach for the supervised summarization problem and we show that it fails to satisfy a certain desirable information monotonicity property.

1. INTRODUCTION

Machine learning involving graphs has a wide range of applications in artificial intelligence Scarselli et al. (2008) ; Dessì et al. (2020) , network analysis, and biological interactions Han et al. (2019) ; Chen et al. (2020) . Graph classification problems use the network structure of the underlying data to improve predictive decision outcomes. However, graph datasets are often enormous, and the algorithms used to extract relevant information from graphs are frequently computationally expensive. Graph summarization addresses these scalability issues by computing reduced representations of graph datasets while retaining relevant information. As with numerous other problems in machine learning, the precise meaning of "reduced representation" does not have one single mathematical definition, and there is no single objective function being optimized. There are thus various approaches to this problem. For a survey, see Liu et al. (2018) . The particular type of approach of interest in this paper takes a dataset of graphs and a number k as input and outputs, for each graph G in the dataset, a subgraph H ⊆ G is induced by k vertices. Optimal transport, the general problem of moving one distribution of mass to another as efficiently as possible, has been used in many recent graph-related problems, such as graph matching via the Gromov-Wasserstein distance Xu et al. (2019) . One recent approach to the graph summarization problem that allows for the incorporation of user-engineered prior information is the Optimal Transport based Compression (OTC) approach of Garg & Jaakkola (2019). Their approach is as follows: a graph G, a target number k of vertices, a probability distribution ρ 0 on the vertices of G, and a cost function c : E(G) → R, where E(G) denotes the set of edges of G, are given as input. A probability distribution ρ 1 is computed by minimizing the Wasserstein distance on G between ρ 0 and ρ 1 with respect to the cost function c, with the constraint, that the number of vertices in the support of ρ 1 is at most k. The output subgraph H is the one induced by the vertices in the support of ρ 1 . Prior information can be incorporated into the method via appropriately choosing ρ 0 and c, but in the prior work, this "prior information" is not learned, and ρ 0 and c are set heuristically. In the present work, we propose a novel supervised summarization algorithm based on Optimal Transport that estimates principled values for the parameters from input data. We show that it empirically surpasses the state-of-the-art performance (including the performance of the specific method proposed in OTC) on selected real and synthetic datasets. The novelty of the summarization algorithm is that we set our optimal transport parameters in terms of node attributes' and edge indicators' information about class variables. Along the way, we develop an estimator for the mutual information between a latent position vector graph and a class label. This extends the wellstudied problem of information-theoretic measure estimation Kraskov et al. ( 2004 2018) objective function for the task of supervised graph summarization and show that it is NP-hard to optimize. Using this new framework, we explore the limitations of the optimal-transport-based approach, both theoretically and empirically. Specifically, we formulate a notion of information monotonicity of an optimal transport parameter pair with respect to a data distribution. This is the desirable property that means the flow cost decreases monotonically as the mutual information of the resulting summarized graph data with the corresponding class labels increases. This means that optimizing flow cost increases class label information (but may not optimize it). We show that any optimal transport parameter pair satisfying natural properties fails to exhibit information monotonicity for at least some data distributions. Contribution: We summarize our contribution in this paper as follows: We propose a novel information-theoretic attributed graph summarization problem formulation in which the goal is to choose a graph summary that maximizes the mutual information between sample graphs and class labels. We theoretically prove that the problem of maximization of the mutual information between attributed graphs and class variables with the knowledge of the data distribution is NP-hard, even approximately. We then show via explicit constructions some limitations of the optimal transport graph compression approach in Garg & Jaakkola (2019)(OTC). We then introduce a supervised graph summarization framework based on optimal transport and experimentally show that it outperforms the baseline (no compression) and the unsupervised OTC method in terms of post-compression classification time and test accuracy, despite suffering from the fundamental limitations of our theoretical contribution.

2. PROBLEM FORMULATION: SUPERVISED GRAPH SUMMARIZATION

We formulate the supervised graph summarization problem as follows. We fix a target compression ratio κ ∈ (0, 1), which will be the ratio of the number of vertices in a summarized graph to the number in the original graph. A probability distribution D over tuples (G, X, C) is fixed by nature and unknown to the summarizer and the classifier. Here, the graphs G are defined on a single common set V of vertices, X is a |V | × d matrix whose rows are d-dimensional feature vectors corresponding to the vertices of G, and C is a class label coming from some fixed set C. A dataset {(G i , X i , C i )} m i=1 ∼ D m consisting of m independent and identically distributed (iid) samples from D is presented to us. Our task is, given sample {(G i , X i , C i )} m i=1 but no knowledge of distribution D, to select a subset H ⊆ V of vertices satisfying the following:  H = arg max U ⊆V,|U |≤κ|V | I((G U , X U ); C). where the expectation is taken with respect to the joint distribution of X and Y . The objective function in (1) i.e. MI between an attributed graph and its class label is defined as follows: An attributed graph consists of both a graph and a collection of node features. Let g : (0, ∞) → R be a convex function with g(1) = 0. Given graph G V with fixed set of vertices V = {v 1 , . . . , v k } and features set X V = {X v1 , . . . , X v k }, the mutual information between (G V , X V ) and class variable



); Moon et al. (2017); Noshad et al. (2017). Our estimator is inspired by EDGE's fast and relatively accurate implementation Noshad et al. (2019). Toward providing a theory for the limitations of our approach to graph summarization, we propose a natural information-theoretic Cover & Thomas (1991); Yasaei Sekeh et al. (

)Here, G U is the subgraph of G induced by the vertices in U , and X U is the matrix of corresponding feature vectors. The standard definition of function I(•; •) is the Shannon mutual informationCover &  Thomas (2006), defined as follows: for random variables X and Y on a common probability space, I(X; Y ) = E P (X,Y ) log P (X, Y ) P (X)P (Y ) ,

