LEARNING TO MINE APPROXIMATE NETWORK MOTIFS

Abstract

Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many datasets. However, strong combinatorial bottlenecks have made it difficult to extract motifs and use them in learning tasks without strong constraints on the motif properties. In this work we propose a representation learning method based on learnable graph coarsening, MotiFiesta which is the first to be able to extract large and approximate motifs in a fully differentiable manner. We build benchmark datasets and evaluation metrics which test the ability our proposed and future models to capture different aspects of motif discovery where ground truth motifs are not known. Finally, explore the notion of exploiting learned motifs as an inductive bias in real-world datasets by showing competitive performance on motif-based featuresets with established real-world benchmark datasets against concurrent architectures.

1. INTRODUCTION

In many application domains, observing an over-represented substructure in a dataset is seen as evidence for its importance in network function. For example, early studies on network motifs enumerated all possible over-represented small subgraphs across various datasets and uncovered structures which explain the behaviour of real-world systems have been discovered such as the feedforward loop in gene regulatory networks, and the bi-parallel motif in ecological food chains (Milo et al., 2002) . More recently, motif libraries have shown strong utility in many machine learning contexts such as (Jin et al., 2020 ), classification (Zhang et al., 2020; Acosta-Mendoza et al., 2012; Thiede et al., 2021; Besta et al., 2022) 2020) and explainability (Perotti et al., 2022) . Although exhaustively mining motifs is known to be NP-hard (Yu et al., 2020) , motif utility has made the discovery task a key challenge in data mining for the past 30 years. The aim of this work is to formally expose the task of motif mining to machine learning models in an effort to enhance both the discovery of motifs and the representation power of learned models. Any motif mining algorithm has to solve two computationally intensive steps: subgraph search and graph matching. The process of discovering a new occurrence of the motif involves a search over the set of subgraphs of a given dataset which yields a search space that grows exponentially with the number of nodes in the dataset as well as in the motif. Next, for a candidate motif and subgraph, a graph matching procedure is needed to determine whether the candidate can be included in the set of instances of the motif. Despite these barriers, many motif mining tools have been proposed Nijssen & Kok (2004) ; Yan & Han (2002) ; Wernicke (2006) , all of which rely on simplifications of the task or a priori assumptions about the desired motifs. These simplifications include bounding motif size Alon et al. (2008 ), topology constraints Reinharz et al. (2018) , and simplified subgraph matching criteria such as strict isomorphism, among others. Besides those, an important limitation that is often overlooked is the variability inherent to many motif sets particularly in biological networks. Network datasets often represent dynamic or noisy processes and algorithms which limit their matching procedure to exact isomorphism will overlook a large set of possible motif candidates. Some tools have addressed this challenge, again with strong limitations such as REAFUM (Li & Wang, 2015) which allows for errors in node labelling, and RAM (Zhang & Yang, 2008) which tolerates a fixed number of edge deletions within a given



, representation learning Bevilacqua et al. (2021); Cotta et al. (2021); Rossi et al. (

