GRAPH FOURIER MMD FOR SIGNALS ON GRAPHS Anonymous authors Paper under double-blind review

Abstract

While numerous methods have been proposed for computing distances between probability distributions in Euclidean space, relatively little attention has been given to computing such distances for distributions on graphs. However, there has been a marked increase in data that either lies on graph (such as protein interaction networks) or can be modeled as a graph (single cell data), particularly in the biomedical sciences. Thus, it becomes important to find ways to compare signals defined on such graphs. Here, we propose Graph Fourier MMD (GFMMD), a novel a distance between distributions and signals on graphs. GFMMD is defined via an optimal witness function that is both smooth on the graph and maximizes difference in expectation between the pair of distributions on the graph. We find an analytical solution to this optimization problem as well as an embedding of distributions that results from this method. We also prove several properties of this method including scale invariance and applicability to disconnected graphs. We showcase it on graph benchmark datasets as well on single cell RNA-sequencing data analysis. In the latter, we use the GFMMD-based gene embeddings to find meaningful gene clusters. We also propose a novel type of score for gene selection called gene localization score which helps select genes for cellular state space characterization.

1. INTRODUCTION

With the advent of high dimensional, high throughput data in fields ranging from biology, to finance, to physics, it becomes important to develop methods to perform high dimensional statistics in this sphere. In particular, the analysis of signals (or features in the data) and their pattern of spread through the landscape of data poses a challenge. If the signals act as functions on a low dimensional space like R 2 or R 3 , it is possible to visualize them to gain a sense of the data. But what if these signals act on higher dimensional space like R 9000 ? This could easily be the case in practice when a set of observations carries many variables, such as single-cell data. In order to handle data like this, one useful assumption has been that the data lies intrinsically in a lower-dimensional manifold M, i.e., the Manifold hypothesis. This hypothesis has motivated low-dimensional embedding algorithms such as spectral clustering (Ng et al., 2001) , tSNE (van der Maaten & Hinton, 2008), diffusion maps (Coifman & Lafon, 2006), and PHATE (Moon et al., 2019) . In such algorithms, the data is first converted to an affinity graph, found by first computing distances between data points, and then affinities using a kernel function on the distances. This allows us to represent high dimensional data in a simpler form. Here, we use this representation to propose a distance between distributions or signals on such high dimensional data graphs called Graph Fourier Maximum Mean Discrepancy (GFMMD). Note, that GFMMD can also work on signals that naturally arise from graph or network structures, i.e., features of people in social interaction graphs. Thus far while GNNs and other methods have focused on organizing and classifying nodes, there has been little focus on organizing the variables / signals themselves on these abstract spaces. For instance, in many measurements or sensors whose structure can be modelled as a graph, there is a need to understand the relationship between measured features. One example that arises in biology is that of single cell data. Here the cellular manifold can be modelled as a nearest neighbors graph, and each cell has measurements of thousands of genes, and there is a great deal of interest in understanding the relationships between genes and how and whether their expression is localized to parts of the cellular manifold.

