ROBUST GRAPH DICTIONARY LEARNING

Abstract

Traditional Dictionary Learning (DL) aims to approximate data vectors as sparse linear combinations of basis elements (atoms) and is widely used in machine learning, computer vision, and signal processing. To extend DL to graphs, Vincent-Cuaz et al. 2021 proposed a method, called GDL, which describes the topology of each graph with a pairwise relation matrix (PRM) and compares PRMs via the Gromov-Wasserstein Discrepancy (GWD). However, the lack of robustness often excludes GDL from a variety of real-world applications since GWD is sensitive to the structural noise in graphs. This paper proposes an improved graph dictionary learning algorithm based on a robust Gromov-Wasserstein discrepancy (RGWD) which has theoretically sound properties and an efficient numerical scheme. Based on such a discrepancy, our dictionary learning algorithm can learn atoms from noisy graph data. Experimental results demonstrate that our algorithm achieves good performance on both simulated and real-world datasets.

1. INTRODUCTION

Dictionary learning (DL) seeks to learn a set of basis elements (atoms) from data and approximates data samples by sparse linear combinations of these basis elements (Mallat, 1999; Mairal et al., 2009; Tošić and Frossard, 2011) , which has numerous machine learning applications including dimensionality reduction (Feng et al., 2013; Wei et al., 2018 ), classification (Raina et al., 2007; Mairal et al., 2008) , and clustering (Ramirez et al., 2010; Sprechmann and Sapiro, 2010) , to name a few. Although DL has received significant attention, it mostly focuses on vectorized data of the same dimension and is not amenable to graph data (Xu, 2020; Vincent-Cuaz et al., 2021; 2022) . Many exciting machine learning tasks use graphs to capture complex structures (Backstrom and Leskovec, 2011; Sadreazami et al., 2017; Naderializadeh et al., 2020; Jin et al., 2017; Agrawal et al., 2018) . DL for graphs is more challenging due to the lack of effective means to compare graphs. Specifically, evaluating the similarity between one observed graph and its approximation is difficult, since they are often with different numbers of nodes and the node correspondence across graphs is often unknown (Xu, 2020; Vincent-Cuaz et al., 2021) . The seminal work of Vincent-Cuaz et al. (2021) proposed a DL method for graphs based on the Gromov-Wasserstein Discrepancy (GWD) that is a variant of the Gromov-Wasserstein distance. Gromov-Wasserstein distance compares probability distributions supported on different metric spaces using pairwise distances (Mémoli, 2011) . By expressing each graph as a probability measure and capturing the graph topology with a pairwise relation matrix (PRM), comparing graphs can be naturally formulated as computing the GWD, since both the node correspondence and the discrepancy of the compared graphs are calculated (Peyré et al., 2016; Xu et al., 2019b) . However, observed graphs often contain structural noise including spurious or missing edges, which leads to the differences between the obtained PRMs and the true ones (Donnat et al., 2018; Xu et al., 2019b) . Since GWD lacks robustness (Séjourné et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022) , the inaccuracies of PRMs may severely affect GWD and the effectiveness of DL in real-world applications. Contributions. To handle the inaccuracies of PRMs, this paper first proposes a novel robust Gromov-Wasserstein discrepancy (RGWD) which adopts a minimax formulation. We prove that the inner maximization problem has a closed-form solution and derive an efficient numerical scheme to approximate RGWD. Under suitable assumptions, such a numerical scheme is guaranteed to find a δ-stationary solution within O( 1 δ 2 ) iterations. We further prove that RGWD is lower bounded and the lower bound is achieved if and only if two graphs are isomorphic. Therefore, RGWD can be employed to compare graphs. RGWD also satisfies the triangle inequality which is of its own interest and allows numerous potential applications. A robust graph dictionary learning (RGDL) algorithm is thereby developed to learn atoms from noisy graph data, which assesses the quality of approximated graphs via RGWD. Numerical experiments on both synthetic and real-world datasets demonstrate that RGDL achieves good performance. The rest of the paper is organized as follows. In Sec. 2, a comprehensive review of the background is given. Sec. 3 presents RGWD and the numerical approximation scheme for RGWD. RGDL is delineated in Sec. 4. Empirical results are demonstrated in Sec. 5. We finally discuss related work in Sec. 6.

2.1. OPTIMAL TRANSPORT

We first present the notation used throughout this paper and then review the definition of the Gromov-Wasserstein distance that originates from the optimal transport theory (Villani, 2008; Peyre and Cuturi, 2018) . Notation. We use bold lowercase symbols (e.g. x), bold uppercase letters (e.g. A), uppercase calligraphic fonts (e.g. X ), and Greek letters (e.g. α) to denote vectors, matrices, spaces (sets), and measures, respectively. 1 d ∈ R d is a d-dimensional all-ones vector. ∆ d is the probability simplex with d bins, namely the set of probability vectors ∆ d = a ∈ R d + | d i=1 a i = 1 . A[i, : ] and A[:, j] are the i th row and the j th column of matrix A respectively. Given a matrix A, A F and A ∞ denote its Frobenius norm and element-wise ∞ -norm (i.e., A ∞ = max i,j |A ij |), respectively. The cardinality of set A is denoted by |A|. The bracketed notation n is the shorthand for integer sets {1, 2, . . . , n}. A discrete measure α is denoted by α = m i=1 a i δ xi , where δ x is the Dirac measure at position x, i.e., a unit of mass infinitely concentrated at x. Gromov-Wasserstein distance. Optimal Transport addresses the problem of transporting one probability measure towards another probability measure with the minimum cost (Villani, 2008; Peyre and Cuturi, 2018) . The induced cost defines a distance between the two probability measures. Gromov-Wasserstein (GW) distance extends classic optimal transport to compare probability measures supported on different spaces (Mémoli, 2011) . Let (X , d X ) and (Y, d Y ) be two metric spaces. Given two probability measures α = m i=1 p i δ xi and β = n i =1 q i δ y i where x 1 , x 2 , . . ., x m ∈ X and y 1 , y 2 , . . ., y n ∈ Y, the r-GW distance between α and β is defined as 



where the feasible domain of the transport plan T = [T ii ] is given by the setΠ(p, q) = T ∈ R m×n + T1 n = p, T 1 m = q ,and D ii jj calculates the difference between pairwise distances, i.e., D ii jj = |d X (x i , x j )d Y (y i , y j )|.

