LEARNING TO MINE APPROXIMATE NETWORK MOTIFS

Abstract

Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many datasets. However, strong combinatorial bottlenecks have made it difficult to extract motifs and use them in learning tasks without strong constraints on the motif properties. In this work we propose a representation learning method based on learnable graph coarsening, MotiFiesta which is the first to be able to extract large and approximate motifs in a fully differentiable manner. We build benchmark datasets and evaluation metrics which test the ability our proposed and future models to capture different aspects of motif discovery where ground truth motifs are not known. Finally, explore the notion of exploiting learned motifs as an inductive bias in real-world datasets by showing competitive performance on motif-based featuresets with established real-world benchmark datasets against concurrent architectures.

1. INTRODUCTION

In many application domains, observing an over-represented substructure in a dataset is seen as evidence for its importance in network function. For example, early studies on network motifs enumerated all possible over-represented small subgraphs across various datasets and uncovered structures which explain the behaviour of real-world systems have been discovered such as the feedforward loop in gene regulatory networks, and the bi-parallel motif in ecological food chains (Milo et al., 2002) . More recently, motif libraries have shown strong utility in many machine learning contexts such as (Jin et al., 2020 ), classification (Zhang et al., 2020; Acosta-Mendoza et al., 2012; Thiede et al., 2021; Besta et al., 2022 ), representation learning Bevilacqua et al. (2021); Cotta et al. (2021); Rossi et al. (2020) and explainability (Perotti et al., 2022) . Although exhaustively mining motifs is known to be NP-hard (Yu et al., 2020) , motif utility has made the discovery task a key challenge in data mining for the past 30 years. The aim of this work is to formally expose the task of motif mining to machine learning models in an effort to enhance both the discovery of motifs and the representation power of learned models. Any motif mining algorithm has to solve two computationally intensive steps: subgraph search and graph matching. The process of discovering a new occurrence of the motif involves a search over the set of subgraphs of a given dataset which yields a search space that grows exponentially with the number of nodes in the dataset as well as in the motif. Next, for a candidate motif and subgraph, a graph matching procedure is needed to determine whether the candidate can be included in the set of instances of the motif. Despite these barriers, many motif mining tools have been proposed Nijssen & Kok (2004); Yan & Han (2002) ; Wernicke (2006) , all of which rely on simplifications of the task or a priori assumptions about the desired motifs. These simplifications include bounding motif size Alon et al. (2008 ), topology constraints Reinharz et al. (2018) , and simplified subgraph matching criteria such as strict isomorphism, among others. Besides those, an important limitation that is often overlooked is the variability inherent to many motif sets particularly in biological networks. Network datasets often represent dynamic or noisy processes and algorithms which limit their matching procedure to exact isomorphism will overlook a large set of possible motif candidates. Some tools have addressed this challenge, again with strong limitations such as REAFUM (Li & Wang, 2015) which allows for errors in node labelling, and RAM (Zhang & Yang, 2008) which tolerates a fixed number of edge deletions within a given motif. All of these constraints limit the set of observable motifs and can lead to us missing important features of datasets. For this reason, we emphasize that our proposed methodology is built to support the discovery of approximate motifs. Recent success of graph representation learning, particularly in an unsupervised context, presents an opportunity for circumventing some of these bottlenecks (Karalias & Loukas, 2020) . Namely, by allowing graph embedding models to leverage the statistical properties of a dataset, we can cast the search and matching problems to efficient operations such as real-valued vector distances. Of course, as is common with neural methods we sacrifice convergence and exactness guarantees in exchange for flexibility and speed. In this regard, there has been extensive work on problems related to motif mining using neural architecture such as subgraph counting Teixeira et al. ( 2022 2020). To our knowledge neural motif mining problems for approximate subgraphs have yet to be proposed. Although related work has shown promise in similar motif mining tasks, by (Oliver et al., 2022) in domain-specific and partially differentiable applications, and applied to the related problems of frequent subgraph mining, (Ying et al., 2020) , and discriminative subgraph mining (Zhang et al., 2020) . Finally, the composability of differentiable models allows motif mining to act as pre-training module orienting classifiers towards a robust and rich feature sets. For these reasons, we believe there is a need to formally introduce the motif mining problem to the ML community by providing appropriate benchmarking settings and proposing relevant methodology.

1.1. CONTRIBUTIONS

In this work, we (1) formalize the notion of motif mining as a machine learning task and provide appropriate evaluation metrics as well as benchmarking datasets, (2) propose MotiFiesta, a fully differentiable model as a first solution to learnable motif mining which discovers new motifs in seconds for large datasets, and (3) we show that motif mining could also serve as an effective unsupervised pre-training routine and interpretable feature selector in real-world datasets.

2.1. NETWORK MOTIF DEFINITION

We start from the classical definition of motif which is a subgraph with a larger frequency than expected (Milo et al., 2002) . More formally, let g = (V, E, X) be a connected subgraph drawn from a graph dataset G, where V is a set of nodes, E ⊆ V × V is a set of edges, and X ∈ R |V |×d is a feature matrix. The frequency of subgraph g is given by: f (g, G) = |{h : h ≃ g ∀ h ⊂ G}| (1) where we count the number of subgraphs h isomorphic to g. The raw frequency of a motif leads to the task of frequent subgraph mining (Jiang et al., 2013) where we are interested in finding the set of subgraphs g with maximal frequency. However to obtain significant motifs, the frequency must be normalized by the frequency of the subgraph in an null model that preserves the generic properties of the original network while ablating significantly enriched subgraphs (Milo et al., 2002) . The null graphs give us a baseline expectation of subgraph occurrence and therefore points us toward significantly enriched subgraphs. Given a randomized (aka null) dataset G, g is considered a motif of G if |f (g, G)| |f (g, G)| > α is sufficiently large for some threshold α ∈ [0, 1]. An approximate motif follows the same definition but we simply replace the isomorphism condition with a graph similarity function such as a graph kernel (Vishwanathan et al., 2010; Kriege et al., 



); Chen et al. (2020); Liu et al. (2020) and graph matching Li et al. (2019); Fey et al. (

