GRAPH FOURIER MMD FOR SIGNALS ON GRAPHS Anonymous authors Paper under double-blind review

Abstract

While numerous methods have been proposed for computing distances between probability distributions in Euclidean space, relatively little attention has been given to computing such distances for distributions on graphs. However, there has been a marked increase in data that either lies on graph (such as protein interaction networks) or can be modeled as a graph (single cell data), particularly in the biomedical sciences. Thus, it becomes important to find ways to compare signals defined on such graphs. Here, we propose Graph Fourier MMD (GFMMD), a novel a distance between distributions and signals on graphs. GFMMD is defined via an optimal witness function that is both smooth on the graph and maximizes difference in expectation between the pair of distributions on the graph. We find an analytical solution to this optimization problem as well as an embedding of distributions that results from this method. We also prove several properties of this method including scale invariance and applicability to disconnected graphs. We showcase it on graph benchmark datasets as well on single cell RNA-sequencing data analysis. In the latter, we use the GFMMD-based gene embeddings to find meaningful gene clusters. We also propose a novel type of score for gene selection called gene localization score which helps select genes for cellular state space characterization.

1. INTRODUCTION

With the advent of high dimensional, high throughput data in fields ranging from biology, to finance, to physics, it becomes important to develop methods to perform high dimensional statistics in this sphere. In particular, the analysis of signals (or features in the data) and their pattern of spread through the landscape of data poses a challenge. If the signals act as functions on a low dimensional space like R 2 or R 3 , it is possible to visualize them to gain a sense of the data. But what if these signals act on higher dimensional space like R 9000 ? This could easily be the case in practice when a set of observations carries many variables, such as single-cell data. In order to handle data like this, one useful assumption has been that the data lies intrinsically in a lower-dimensional manifold M, i.e., the Manifold hypothesis. This hypothesis has motivated low-dimensional embedding algorithms such as spectral clustering (Ng et al., 2001 ), tSNE (van der Maaten & Hinton, 2008) , diffusion maps (Coifman & Lafon, 2006), and PHATE (Moon et al., 2019) . In such algorithms, the data is first converted to an affinity graph, found by first computing distances between data points, and then affinities using a kernel function on the distances. This allows us to represent high dimensional data in a simpler form. Here, we use this representation to propose a distance between distributions or signals on such high dimensional data graphs called Graph Fourier Maximum Mean Discrepancy (GFMMD). Note, that GFMMD can also work on signals that naturally arise from graph or network structures, i.e., features of people in social interaction graphs. Thus far while GNNs and other methods have focused on organizing and classifying nodes, there has been little focus on organizing the variables / signals themselves on these abstract spaces. For instance, in many measurements or sensors whose structure can be modelled as a graph, there is a need to understand the relationship between measured features. One example that arises in biology is that of single cell data. Here the cellular manifold can be modelled as a nearest neighbors graph, and each cell has measurements of thousands of genes, and there is a great deal of interest in understanding the relationships between genes and how and whether their expression is localized to parts of the cellular manifold. Here we address the question of how to organize and compare signals on graphs in such a way that accounts for geometric structure on their underlying space. In particular, given a weighted graph G = (V, E, w) and a set of functions {f i } i on the vertices: f i : V → R, how can we structure and analyze these signals? We will first consider the case when f i is a probability mass function and extend the framework to arbitrary signals. This has a very natural applications to many modern datasets. To illustrate this, we focus on the application of embedding a set of genes on a graph of cells, as created from single cell RNA-sequencing data, and also in measuring whether the expression of a gene is localized (i.e., characteristic of a subpopulation of cells) or global like a house-keeping gene. We present a new distance that belongs to the family of integral probability metrics (Sriperumbudur et al., 2012) . Integral Probability Metrics (IPM) are distances between probability distributions that are characterized by a witness function that maximizes the discrepancy between distributions in expectation. The Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) distances are a popular class of IPMs, they assume further structure in the space of witness functions, requiring that they come from a reproducing kernel Hilbert space. Our notion of MMD, that we call Graph Fourier MMD (GFMMD), is a distance between signals on a data graph that is found by analytically solving for an optimal witness function. Furthermore, through the use of Chebyshev polynomials (Mason & Handscomb, 2002) , GFMMD can be computed rapidly, and has a closed-form solution. We demonstrate its potential on toy datasets as well as single cell data, where we use it to identify gene modules. Our main contributions are as follows: 1) We define Graph Fourier MMD as a distance between signals on arbitrary graphs, and prove that it is both an integrable probability metric and maximum mean discrepancy. 2) We derive an exact analytical solution for GFMMD which can be approximated in O(n(log n + m 2 )) time to calculate all pairwise-distances between distributions, where n is the number of vertices of the graph and m is the number of signals. 3) We derive feature map for GFMMD that allows for efficient embeddings and dimensionality reduction. 4) We provide an efficient Chebyshev approximation method for computing GFMMD among a set of signals. 5) We showcase application of GFMMD to single cell RNA-sequencing data.

1.1. RELATED WORK

Spectral methods, such as (Coifman & Lafon, 2006; Belkin & Niyogi, 2003; Bronstein & Bronstein, 2010) , define an embedding of the nodes of a graph using the eigendecomposition of a graph operator (Laplacian or diffusion operator). Similar to these methods, we use the graph's spectral properties to define an embedding of signals on the graph, and we show that this embedding preserve an MMD distance between signals. The closest related work is that of Diffusion EMD Tong et al. (2021) , which involves diffusion graph signals to different scales using a diffusion operator (similar to that of a diffusion map Coifman & Lafon (2006) ) to create multiscale density estimates of the data. Then Diffusion EMD computes weighted L 1 distance between the multiscale density estimates of different signals. While this method is faster than most primal methods for EMD computation, it can be inaccurate unless the graph is significantly large. Earlier methods that have been proposed for empirical estimations of high dimensional EMD include the Sinkhorn method Cuturi (2013), which involves Sinkhorn iterations (repeated normalization) of a joint probability distribution to converge at a distribution that describes a valid transport plan, i.e. whose marginals agree with the two empirical distributions. The authors of (Solomon et al., 2015; Huguet et al., 2022) extent the Sinkhorn algorithm to graphs with a heat-geodesic ground distance. Their algorithm can be computed efficiently for two signals on a sparse graphs, but does not provide an embedding of signals. In (Le et al., 2022; 2019; Essid & Solomon, 2018) , the authors consider the EMD between distributions defined on a distance graph, that is the edge weights define the cost of moving mass from one node to another. The authors in (Le et al., 2022; 2019) provide a closed-form solution that relies on a graph shortest path distance. In this setting, there is no sparse approximation to diffusion distances in terms of graph shortest path. We consider a different problem where the edges of the graph are affinities.

