LEARNING LATENT TOPOLOGY FOR GRAPH MATCHING

Abstract

Graph matching (GM) has been traditionally modeled as a deterministic optimization problem characterized by an affinity matrix under pre-defined graph topology. Though there have been several attempts on learning more effective node-level affinity/representation for matching, they still heavily rely on the initial graph structure/topology which is typically obtained through heuristic ways (e.g. Delaunay or k-nearest) and will not be adjusted during the learning process to adapt to problem-specific patterns. We argue that such a mechanism for learning on the fixed topology may restrict the potential of a GM solver for specific tasks, and propose to learn latent graph topology in replacement of the fixed topology as input. To this end, we devise two types of latent graph generation procedures in a deterministic and generative fashion, respectively. Particularly, the generative procedure emphasizes the across-graph consistency and thus can be viewed as a matching-guided generative model. Our methods show superior performance over previous state-of-the-arts on public benchmarks.

1. INTRODUCTION

Being a long standing NP-hard problem (Loiola et al., 2007) , graph matching (GM) has received persistent attention from the machine learning and optimization communities for many years. Concretely, for two graphs with n nodes for each, graph matching seeks to solvefoot_0 : max where the affinity matrix M ∈ R n 2 ×n 2 + encodes node (diagonal elements) and edge (off-diagonal) affinities/similarities and z is the column-wise vectorization form of the permutation matrix Z. H is a selection matrix ensuring each row and column of Z summing to 1. 1 is a column vector filled with 1. Eq. ( 1) is the so-called quadratic assignment problem (QAP) (Cho et al., 2010) . Maximizing Eq. (1) amounts to maximizing the sum of the similarity induced by matching vector Z. While Eq. (1) does not encode the topology of graphs, Zhou & Torre (2016) further propose to factorize M to explicitly incorporate topology matrix, where a connectivity matrix A ∈ {0, 1} n×n is used to indicate the topology of a single graph (A ij = 1 if there exists an edge between nodes i and j; A ij = 0 otherwise). To ease the computation, Eq. ( 1) is typically relaxed by letting z ∈ [0, 1] n 2 and keeping other parts of Eq. ( 1) intact. Traditional solvers to such relaxed problem generally fall into the categories of iterative update (Cho et al., 2010; Jiang et al., 2017) or numerical continuation (Zhou & Torre, 2016; Yu et al., 2018) , where the solvers are developed under two key assumptions: 1) Affinity M is pre-computed with some non-negative metrics, e.g. Gaussian kernel, L 2 -distance or Manhattan distance; 2) Graph topology is pre-defined as input either in dense (Schellewald & Schnörr, 2005) or sparse (Zhou & Torre, 2016) fashion. There have been several successful attempts towards adjusting the first assumption by leveraging the power of deep networks to learn more effective graph representation for GM (Wang et al., 2019a; Yu et al., 2020; Fey et al., 2020) . However, to our best knowledge, there is little previous work questioning and addressing the problem regarding the second assumption in the context of learning-based graph matching 2 . For example, existing standard pipeline of keypoint matching in computer vision will construct initial topology by Delaunay triangulation or k-nearest neighbors. Then this topology will be freezed throughout the subsequent learning and matching procedures. In this sense, the construction of graph topology is peeled from matching task as a pre-processing stage. More examples can be found beyond the vision communities such as in social network alignment (Zhang & Tong, 2016; Heimann et al., 2018; Xiong & Yan, 2020) assuming fixed network structure for individual node matching in two networks. We argue that freezing graph topology for matching can hinder the capacity of graph matching solvers. For a pre-defined graph topology, the linked nodes sometimes result in less meaningful interaction, especially under the message-passing mechanism in graph neural networks (Kipf & Welling, 2017) . We give a schematic demonstration in Fig. 1 . Though some earlier attempts (Cho & Lee, 2012; Cho et al., 2013) seek to adjust the graph topology under traditional non-deep learning setting, such procedures cannot be readily integrated into end-to-end deep learning frameworks due to undifferentiable nature. Building upon the hypothesis that there exists some latent topology better than heuristically created one for GM, our aim is to learn it (or its distribution) for GM. Indeed, jointly solving matching and graph topology learning can be intimidating due to the combinatorial nature, which calls for more advanced approaches. In this paper, we propose an end-to-end framework to jointly learn the latent graph topology and perform GM, termed as deep latent graph matching (DLGM). We leverage the power of graph generative model to automatically produce graph topology from given features and their geometric relations, under specific locality prior. Different from generative learning on singleton graphs (Kipf & Welling, 2016; Bojchevski et al., 2018) , our graph generative learning is performed in a pairwise fashion, leading to a novel matching-guided generative paradigm. The source code will be made publicly available. Contributions: 1) We explore a new direction for more flexible GM by actively learning latent topology, in contrast to previous works using fixed topology as input; 2) Under this setting, we propose a deterministic optimization approach to learn graph topology for matching; 3) We further present a generative way to produce latent topology under a probabilistic interpretation by Expectation-Maximization. This framework can also adapt to other problems where graph topology is the latent structure to infer; 4) Our method achieves state-ofthe-art performance on public benchmarks.

2. RELATED WORKS

In this section, we first discuss existing works for graph topology and matching updating whose motivation is a bit similar to ours while the technique is largely different. Then we discuss relevant works in learning graph matching and generative graph models from the technical perspective. Topology updating and matching. There are a few works for joint graph topology updating and matching, in the context of network alignment. Specifically, given two initial networks for matching, Du et al. (2019) show how to alternatively perform link prediction within each network and node matching across networks based on the observation that these two tasks can benefit to each other. In their extension (Du et al., 2020) , a skip-gram embedding framework is further established under the same problem setting. In fact, these works involve a random-walk based node embedding updating and classification based link prediction modules and the whole algorithm runs in a one-shot optimization fashion. There is neither explicit training dataset nor trained matching model (except for the link classifier), which bears less flavor of machine learning. In contrast, our method involves training an explicit model for topology recovery and matching solving. Specifically, our deterministic technique (see Sec. 3.4.1) solves graph topology and matching in one-shot, while the proposed generative method alternatively estimates the topology and matching (see Sec. 3.4.2) . Our approach allows to fully leverage multiple training samples in many applications like computer vision to boost the performance on test set. Moreover, the combinatorial nature of the matching problem is not addressed in (Du et al., 2019; 2020) , and they adopt a greedy selection strategy instead. While we develop a principled combinatorial learning approach to this challenge. Also their methods rely on a considerable amount of seed matchings, yet this paper directly learns the latent topology from scratch which is more challenging and seldom studied. Learning of graph matching. Early non-deep learning-based methods seek to learn effective metric (e.g. weighted Euclid distance) for node and edge features or affinity kernel (e.g. Gaussian kernel) in a parametric fashion (Caetano et al., 2009; Cho et al., 2013) (Wang et al., 2019a; Yu et al., 2020; Fey et al., 2020) and geometric learning (Zhang & Lee, 2019; Fey et al., 2020) are involved. Rolínek et al. (2020) study the way of incorporating traditional non-differentiable combinatorial solvers, by introducing a differentiatiable blackbox GM solver (Pogancic et al., 2020) . Recent works in tackling combinatorial problem with deep learning (Huang et al., 2019; Kool & Welling, 2018 ) also inspire developing combinatorial deep solvers, for GM problems formulated by both Koopmans-Beckmann's QAP (Nowak et al., 2018; Wang et al., 2019a ) and Lawler's QAP (Wang et al., 2019b et al., 2018; Wang et al., 2018; Bojchevski et al., 2018) . Specifically, Wang et al. (2018) ; Bojchevski et al. (2018) seek to unify graph generative model and generative adversarial networks. In parallel, reinforcement learning has been adopted to generate discrete graphs (De Cao & Kipf, 2018) .

3. LEARNING LATENT TOPOLOGY FOR GM

In this section, we describe details of the proposed framework with two specific algorithms derived from deterministic and generative perspectives, respectively. Both algorithms are motivated by the hypothesis that there exists some latent topology more suitable for matching rather than a fixed one. Note the proposed deterministic algorithm performs a standard forward-backward pass to jointly learn the topology and matching, while our generative algorithm consists of an alternative optimization procedure between estimating latent topology and learning matching under an Expectation-Maximization (EM) interpretation. In general, the generative algorithm assumes that a latent topology is sampled from a latent distribution, where the expected matching accuracy sufficing this distribution is maximized. Therefore, we expect to learn a topology generator sufficing such distribution. We reformulate GM into Bayesian fashion for consistent discussion in Sec. 3.1, detail deterministic/generative latent module in Sec. 3.2 and discuss the loss functions from a probabilistic perspective in Sec. 3.3. We finally elaborate on the holistic framework and the optimization procedure for both algorithms (deterministic and generative) in Sec. 3.4.

3.1. PROBLEM DEFINITION AND BACKGROUND

GM problem can be viewed as a Bayesian variant of Eq. (1). In general, let G (s) and G (t) represent the initial source and target graphs for matching, respectively. We represent graph as G := {X, E, A}, where X ∈ R n×d1 is the representation of n nodes with dimension d 1 . E ∈ R m×d2 are features of m edges and A ∈ {0, 1} n×n is initial connectivity (i.e. topology) matrix by heuristics e.g. Delaunay triangulation. For notational brevity, we assume d 1 and d 2 keep intact after updating the features across each convolutional layers of GNN (i.e., feature dimensions of both nodes and edges will not change after each layer's update). Denote the matching Z ∈ {0, 1} n×n between two graphs, where Z ij = 1 indicates a correspondence exists between node i in G (s) and node j in G (t) , and Z ij = 0 otherwise. Given training samples {Z k , G (s) k , G k } with k = 1, 2, ..., N , the objective of learning-based GM aims to maximize the likelihood: max θ k P θ Z k |G (s) k , G (t) k (2) where θ denotes model parameters. P θ (•) measures the probability of matching Z k given the k-th pair, and is instantiated via a network parameterized by θ. Being a generic module for producing latent topology, our method can be flexibly integrated into existing deep GM frameworks. We build up our method based on state-of-the-art (Rolínek et al., 2020) , which utilizes SplineCNN (Fey et al., 2018) for node/edge representation learning. SplineCNN is a specific graph neural networks which updates a node representation via a weighted summation of its neighbors. The update rule at node i of a standard SplineCNN reads: (x * g)(i) = 1 |N (i)| d1 l=1 j∈N (i) x l (j) • g l (e(i, j)) where x l (j) performs the convolution on node j and outputs a d 1 -dimensional feature. g l (•) delivers the message weight given the edge feature e(i, j). N (i) refers to i's neighboring nodes Summation over neighbors follows the topology A. Since our algorithm learns to generate topology, we need to explicitly express Eq. ( 3) in a differentiable way w.r.t. A. To this end, we rewrite Eq. (3) as: (x * g|A) = ( Â • G) X (4) where Â is the normalized connectivity with each row normalized by the degree |N (i)| (see Eq. ( 3)) of the corresponding node i. G and X correspond to outputs of g l (•) and x l (•) operators, respectively. (• • •) is the Hadamard product. With Eq. ( 4), we thus can perform back-propagation on connectivity/topology A. See more details in Appendix A.2.

3.2. LATENT TOPOLOGY LEARNING

Existing learning-based graph matching algorithms consider A to be fixed throughout the computation without questioning if the input topology is optimal or not. This can be problematic since input graph construction is heuristic, and it never takes into account how suitable it is for the subsequent GM task. In our framework, instead of utilizing a fixed pre-defined topology, we consider to produce latent topology under two settings: 1) a deterministic and 2) a generative way. The former is often more efficient while the latter method can be more accurate at the cost of exploring more latent topology. Note both methods produce discrete topology to verify our hypothesis about the existence of more suitable discrete latent topology for GM problem. The followings describe two deep structures. Deterministic learning: Given input features X and initial topology A, the deterministic way of generating latent topology A ∈ {0, 1} n×n isfoot_2 : A ij = Rounding(sigmoid(y i Wy j )) with Y = GCN(X, A) where GCN(•) is the graph convolutional networks (GCN) (Kipf & Welling, 2017) and y i corresponds to the feature of node i in feature map Y. W is the learnable parameter matrix. Note function Rounding(•) is undifferentiable, and will be discussed in Sec. 3.4.1. Generative learning: We reparameterize the representation as: P (y i |X, A) = N (y i |µ i , diag(σ 2 )) with µ = GCN µ (X, A) and σ = GCN σ (X, A) are two GCNs producing mean and covariance. It is equivalent to sampling a random vector from i.i.d. uniform distribution s ∼ U(0, 1), then applying y = µ + s • σ, where (•) is element-wise product. Similar as Eq. ( 5) by introducing learnable parameter W, the generative latent topology is sampled following i.i.d. distribution over each edge (i, j): P (A|Y) = i j P (A ij |y i , y j ) with P (A ij = 1|y i , y j ) = sigmoid(y i Wy j ) Since sigmoid(•) maps any input into (0, 1), Eq. ( 7) can be interpreted as the probability of sampling edge (i, j). As the sampling procedure is undifferentiable, we apply Gumbel-softmax trick (Jang et al., 2017) as another reparameterization procedure. As such, a latent graph topology A can be sampled fully from distribution P (A) and the procedure becomes differentiable.

3.3. LOSS FUNCTIONS

In this section, we explain three loss functions and the behind motivation: matching loss, locality loss and consistency loss. The corresponding probabilistic interpretation of each loss function can be found in Sec. 3.4.2. These functions are selectively activated in DLGM-D and DLGM-G (see Sec. 3.4). In DLGM-G, different loss functions are activated in inference and learning steps. i) Matching loss. This common term measures how the predicted matching Ẑ diverges from groundtruth Z. Following Rolínek et al. (2020) , we adopt Hamming distance on node-wise matching: L M = Hamming( Ẑ, Z) ii) Locality loss. This loss is devised to account for the general prior that the produced/learnt graph topology should advocate local connection rather than distant one, since two nodes may have less meaningful interaction once they are too distant from each other. In this sense, locality loss serves as a prior or regularizer in GM. As shown in multiple GM methods (Yu et al., 2018; Wang et al., 2019a; Fey et al., 2020) , Delaunay triangulation is an effective way to deliver good locality. Therefore in our method, the locality loss is the Hamming distance between the initial topology A (obtained from Delaunay) and predicted topology A for both source graph and target graph: L L = Hamming(A (s) , A (s) ) + Hamming(A (t) , A (t) ) We emphasize that locality loss serves as a prior for latent graph. It focuses on advocating locality, but not reconstructing the initial Delaunay triangulation (as in Graph VAE (Kipf & Welling, 2016)). iii) Consistency loss. One can imagine that a GM solver is likely to deliver better performance if two graphs in a training pair are similar. In particular, we anticipate the latent topology A (s) and A (t) to be isomorphic under a specific matching, since isomorphic topological structures tend to be easier to match. Driven by this consideration, we devise the consistency loss which measures the level of isomorphism between latent topology A (s) and A (t) : L C (•|Z) = |Z A (s) Z -A (t) | + |ZA (t) Z -A (s) | (10) Note Z does not necessarily refer to the ground-truth, but can be any predicted matching. In this sense, latent topology A (s) and A (t) can be generated jointly given the matching Z as guidance information. This term can also serve as a consistency prior or regularization. We given a schematic example showing the merit of introducing consistency loss in Fig. 2 (b).

3.4. FRAMEWORK

A schematic diagram of our framework is given in Fig. 2 (a) which consists of a singleton pipeline for processing a single image. It consists of three essential modules: a feature backbone (N B ), a latent topology module (N G ) and a feature refinement module (N R ). Specifically, module N G corresponds to Sec. 3.2 with deterministic or generative implementations. Note the geometric relation of keypoints provide some prior for generating topology A. We employ VGG16 (Simonyan & Zisserman, 2014 ) (s) and A (t) are constructed using Delaunay triangulation. Given matching Z as guidance, latent topology A (s) and A (t) are generated from inputs A (s) and A (t) , respectively. Note the learned topology A (s) and A (t) are isomorphic (L c = 0) w.r.t. Z which is easier to match in test, comparing to non-isomorphic input structures (L c = 4). N B N G N R Image with keypoints as N B and feed the produced node feature X and edge feature E to N G . N B also produces a global feature for each image. After generating the latent topology A, we pass over X and E together with A to N R (SplineCNN (Fey et al., 2018) ). The holistic pipeline handling pairwise graph inputs can be found in Fig. 4 in Appendix A.1 which consists of two copies of singleton pipeline processing source and target data (in a Siamese fashion), respectively. Then the outputs of two singleton pipelines are formulated into affinity matrix, followed by a differentiable Blackbox GM solver (Pogancic et al., 2020) with message-passing mechanism (Swoboda et al., 2017) . Note once without N G , the holistic pipeline with only N B + N R is identical to the method in (Rolínek et al., 2020) . Readers are referred to this strong baseline (Rolínek et al., 2020) for more mutual algorithmic details.

3.4.1. OPTIMIZATION WITH DETERMINISTIC LATENT GRAPH

We show how to optimize with deterministic latent graph module, where the topology A is produced by Eq. ( 5). The objective of matching conditioned on the produced latent topology A becomes: max k P Z k |A (s) k , A (t) k , G (s) k , G (t) k Eq. ( 11) can be optimized with standard back-propagation with three loss terms activated, except for the Rounding function (see Eq. ( 5)), which makes the procedure undifferentiable. To address this, we use straight-through operator (Bengio et al., 2013) which performs a standard rounding during the forward pass but approximates it with the gradient of identity during the backward pass on [0, 1]: ∂Rounding(x)/∂x = 1 Though there exist some unbiased gradient estimators (e.g., REINFORCE (Williams, 1992)), the biased straight-through estimator proved to be more efficient and has been successfully applied in several applications (Chung et al., 2017; Campos et al., 2018) . All the network modules (N G + N B + N R ) are simultaneously learned during the training. All three losses are activated in the learning procedure (see Sec. 3.3), which are applied on the predicted matching Ẑ, the latent topology A (s) and A (t) . We term the algorithm under this setting DLGM-D.

3.4.2. OPTIMIZATION WITH GENERATIVE LATENT GRAPH

See more details in Appendix A.3. In this setting, the source and target latent topology A (s) and A (t) are sampled according to Eq. ( 6) and ( 7). The objective becomes: max k A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k Unfortunately, directly optimizing Eq. ( 13) is difficult due to the integration over A which is intractable. Instead, we maximize the evidence lower bound (ELBO) (Bishop, 2006) as follows: log P θ (Z|G (s) , G (t) ) ≥ E Q φ (A (s) ,A (t) |G (s) ,G (t) ) log P θ (Z, A (s) , A (t) |G (s) , G (t) ) -log Q φ (A (s) , A (t) |G (s) , G (t) ) where Q φ (A (s) , A (t) |G (s) , G (t) ) can be any joint distribution of A (s) and A (t) given the input graphs G (s) and G (t) . Equality of Eq. ( 14) holds when t) ). For tractability, we rationally introduce the independence by assuming that we can use an identical latent topology module Q φ (corresponding to N G in Fig. 2 (a)) to separately handle each input graph: Q φ (A (s) , A (t) |G (s) , G (t) ) = P θ (A (s) , A (t) |Z, G (s) , G Q φ (A (s) , A (t) |G (s) , G (t) ) = Q φ (A (s) |G (s) )Q φ (A (t) |G (t) ) ) which can greatly simplify the model complexity. Then we can utilize a neural network to model Q φ (similar to modeling P θ ). The optimization of Eq. ( 14) is studied in (Neal & Hinton, 1998) , known as the Expectation-Maximization (EM) algorithm. Optimization of Eq. ( 14) alternates between E-step and M-step. During E-step (inference), P θ is fixed and the algorithm seeks to find an optimal Q φ to approximate the true posterior distribution (see Appendix A.3 for explanation): P θ (A (s) , A (t) |Z, G (s) , G (t) ) (16) During M-step (learning), Q φ is instead fixed and algorithm alters to maximize the likelihood: E Q φ (A (s) |G (s) ),Q φ (A (t) |G (t) ) log P θ (Z, A (s) , A (t) |G (s) , G (t) ) ∝ -L M ( ) We give more details on the inference and learning steps as follows. Inference. This step focuses on deriving posterior distribution P θ (A (s) , A (t) |Z, G (s) , G (t) ) using its approximation Q φ . To this end, we fix the parameters θ in modules N B and N R , and only update the parameters φ in module N G corresponding to Q φ . As stated in Sec. 3.2, we employ the Gumbel-softmax trick for sampling discrete A (Jang et al., 2017) . To this end, we can formulate a 2D vector a ij = [P (A ij = 1), 1 -P (A ij = 1)] . Then the sampling becomes: softmax (log(a ij ) + h ij ; τ ) (18) where h ij is a random 2D vector from Gumbel distribution, and τ is a small temperature parameter. We further impose prior on latent topology A given A through locality loss: log i,j P (A ij |A ij ) ∝ -L L (A, A) which is to preserve the locality in initial topology A. It should also be noted that Z is the predicted matching from current P θ , as Q φ is an approximation. Besides, we also anticipate two generated topology A (s) and A (t) from a graph pair should be similar (isomorphic) given matching Z: log P A (s) , A (t) |Z ∝ -L C A (s) , A (t) |Z In summary, we activate locality loss and consistency loss during the inference step, where the latter loss is conditioned with the predicted matching rather than the ground-truth. Note that the inference step involves twice re-parameterization tricks corresponding to Eq. ( 6) and ( 18), respectively. While the first generates the continuous topology distribution under edge independence assumption, the second performs discrete sampling sufficing the generated topology distribution. Learning. This step optimizes P θ by fixing Q φ . We sample discrete graph topologies As completely from the probability of edge P (A ij = 1). Once latent topology As are sampled, we feed them to module N R together with the node-level features from N B . Only N B and N R are updated in this step, and only matching loss L M is activated. Remark. Note for each pair of graphs in training, we use an identical random vector s for generating both graphs' topology (see Eq. ( 6)). We pretrain the network P θ before alternativly training P θ and Q φ . During pretraining, we activate N B + N R modules and L M loss during pretraining, and feed the network the topology obtained from Delaunay as the latent topology. After pretraining, the optimization will switch between inference and learning steps until convergence. We term the setting of generative latent graph matching as DLGM-G and summarize it in Alg. 1. Algorithm 1: Deep latent graph matching with generative latent graph (DLGM-G) 1: Input: G s , G t and ground-truth Z; Output: matching Ẑ; 2: Pretrain P θ using Eq. ( 11), given Delaunay as input topology; 3: while not converge do 4: # Inference (E-step): 5: Obtain predicted matching Ẑ using fixed P θ ; 6: Update Q φ (i.e. NG) with loss LL + LC (•| Ẑ) according to Eq. ( 16); 7: # Learning (M-step): 8: Obtain predicted graph topology A (s) and A (t) using Q φ ; 9: Update P θ (i.e. NB and NR) with loss LM given A (s) and A (t) according to Eq. ( 17); 10: end while 11: Predict topology and the matching Ẑ with whole network activated (i.e. NG + NB + NR); 

4. EXPERIMENT

We conduct experiments on datasets including Pascal VOC with Berkeley annotation (Everingham et al., 2010; Bourdev & Malik, 2009) , Willow ObjectClass (Cho et al., 2013) and SPair-71K (Min et al., 2019) . We report the per-category and average performance. The objective of all experiments is to maximize the average matching accuracy. Both our DLGM-D and DLGM-G are tested. Peer methods. We conduct comparison experiments against the following algorithms: 1) GMN (Zanfir & Sminchisescu, 2018), which is a seminal work incorporating graph matching into deep learning framework equipped with a spectral solver (Egozi et al., 2013) ; 2) PCA (Wang et al., 2019a) . This method treats graph matching as feature matching problem and employs GCN (Kipf & Welling, 2017) to learn better features; 3) CIE 1 /GAT-H (Yu et al., 2020) . This paper develops a novel embedding and attention mechanism, where GAT-H is the version by replacing the basic embedding block with Graph Attention Networks (Veličković et al., 2018) ; 4) DGMC (Fey et al., 2020) . This method devises a post-processing step by emphasizing the neighborhood similarity; 5) BBGM (Rolínek et al., 2020) . It integrates a differentiable linear combinatorial solver (Pogancic et al., 2020) into a deep learning framework and achieves state-of-the-art performance. Results on Pascal VOC. The dataset (Everingham et al., 2010; Bourdev & Malik, 2009 ) consists of 7,020 training images and 1,682 testing images with 20 classes in total, together with the object bounding boxing for each. Following the data preparation in (Wang et al., 2019a) , each object within the bounding box is cropped and resized to 256 × 256. The number of nodes per graph ranges from 6 to 23. We further follow (Rolínek et al., 2020) under two evaluating metrics: 1) Accuracy: this is the standard metric evaluated on the keypoints by filtering out the outliers; 2) F1-score: this metric is evaluated without keypoint filtering, being the harmonic mean of precision and recall. Experimental results on the two setting are shown in Tab. 1 and Tab. 2. The proposed method under either settings of DLGM-D and DLGM-G outperforms counterparts by accuracy and f1-score. DLGM-G generally outperforms DLGM-D. Discussion can be found in Appendix A.5. Quality of generated topology. We further show the consistency/locality curve vs epoch in Fig. 3 , since both consistency and locality losses can somewhat reflect the quality of topology generation. It shows that both locality and consistency losses descend during the training. Note that the consistency loss with Delaunay triangulation (green dashed line) is far more larger than our generated ones (blue/red dashed line). This clearly supports the claim that our method generates similar (more isomorphic) typologies, as well as preserving locality. 9) and ( 10)) keep decrease over training showing the effectiveness for adaptive topology learning for matching. Results on Willow Object. The benchmark (Cho et al., 2013) consists of 256 images in 5 categories, where two categories (car and motorbike) are subsets selected from Pascal VOC. Following the preparation protocol in Wang et al. (2019a) , we crop the image within the object bounding box and resize it to 256 × 256. Since the dataset is relatively small, we conduct the experiment to verify the transfer ability of different methods under two settings: 1) trained on Pascal VOC and directly applied to Willow (Pt); 2) trained on Pascal VOC then finetuned on Willow (Wt). Results under the two settings are shown in Tab. 3. Since this dataset is relatively small, further improvement is difficult. It is shown both DLGM-D and DLGM-G have good transfer ability. Results on SPair-71K. The dataset (Min et al., 2019) This dataset is considered to contain more difficult matching instances and higher annotation quality. Results are summarized in Tab. 4. Our method consistently improves the matching performance, agreeing with the results in Pascal VOC and Willow.

5. CONCLUSION

Graph matching involves two essential factors: the affinity model and topology. By incorporating learning paradigm for affinity/feature, the performance of matching on public datasets has significantly been improved. However, there has been little previous work exploring more effective topology for matching. In this paper, we argue that learning a more effective graph topology can significantly improve the matching, thus being essential. To this end, we propose to incorporate a latent topology module under an end-to-end deep network framework that learns to produce better graph topology. We also present the interpretation and optimization of topology module in both deterministic and generative perspectives, respectively. Experimental results show that, by learning the latent graph, the matching performance can be consistently and significantly enhanced on several public datasets. Hao Xiong and Junchi Yan. Btwalk: Branching tree random walk for multi-order structured network embedding. IEEE Transactions on Knowledge and Data Engineering, 2020. 

A APPENDIX

A.1 HOLISTIC PIPELINE We show the holistic pipeline of our framework in Fig. 4 consisting of two "singleton pipelines" (see introduction part of Sec. 3 for more details). In general, the holistic pipeline follows the convention in a series of deep graph matching methods by utilizing an identical singleton pipeline to extract features, then exploits the produced features to perform matching (Yu et al., 2020; Wang et al., 2019a; Fey et al., 2020; Rolínek et al., 2020) . Except for the topology module N G , all others parts of our network are the same as those in Rolínek et al. (2020) .

A.2 SPLINECNN

SplineCNN is a method to perform graph-based representation learning via convolution operators defined based on B-splines (Fey et al., 2018) . The initial input to SplineCNN is G = {X, E, A}, where X ∈ G n×d1 and A ∈ {0, 1} n×n indicate node features and topology, respectively (same as in Sec. 3.1). E ∈ [0, 1] n×n×d2 is so-called pseudo-coordinates and can be viewed as n 2 × d 2dimensional edge features for a fully connected graph (in case m = n 2 , see Sec. 3.1). Let normalized edge feature e(i, j) = E i,j,: ∈ [0, 1] d2 if a directed edge (i, j) exists (A i,j = 1), and 0 otherwise (A i,j = 0). Note topology A fully carries the information of N (i) which defines the neighborhood 

Source

Target updated X (s) , E (s) updated X (t) , E (t) GM solver Matching loss of node i. During the learning, X and E will be updated while topology A will not. Therefore SplineCNN is a geometric graph embedding method without adjusting the latent graph topology. B-spline is employed as basic kernel in SplineCNN, where a basis function has only support on a specific real-valued interval (Piegl & Tiller, 2012) . Let ((N q 1,i ) 1≤i≤k1 , ..., (N q d,i ) 1≤i≤k d 2 ) be d 2 B-spline bases with degree q. The kernel size is defined in k = (k 1 , ..., k d2 ). In SplineCNN, the continuous kernel function g l : [a 1 , b 1 ] × ... × [a d2 , b d2 ] → G is defined as: g l (e) = p∈P w p,l • B p (e) where P = (N q 1,i ) i × ... × (N q d,i ) i is the B-spline bases (Piegl & Tiller, 2012) and w p,l is the trainable parameter corresponding to the lth node feature in X, with B p being the product of the basis functions in P: B p = d i=1 N q i,pi (e i ) ( ) where e is the pseudo-coordinate in E. Then, given the kernel function g = (g 1 , ..., g d1 ) and the node feature X ∈ G n×d1 , one layer of the convolution at node i in SplineCNN reads (same as Eq. ( 3)): (x * g)(i) = 1 |N (i)| d1 l=1 j∈N (i) x l (j) • g l (e(i, j)) where x l (j) indicates the convolved node feature value of node j at lth dimension. This formulation can be tensorized into Eq. ( 4) with explicit topology matrix A. In this sense, we can back-propagate the gradient of A. Reader are referred to Fey et al. (2018) for more comprehensive understanding of this method.

A.3 DERIVATION OF DLGM-D

We give more details of the optimization on DLGM-D in this section. This part also interprets some basic formulation conversion (e.g. from Eq. ( 2) to its Bayesian form). First, we assume there is no latent topology A (s) and A (s) at the current stage. In this case, the objective of GM is simply: max k P θ Z k |G (s) k , G (t) k ( ) where P θ measures the probability of a matching Z k given graph pair G k . If we impose the latent topology A (s) and A (t) , as well as some distribution over them, then Eq. ( 24) can be equivalently expressed as: max k P θ Z k |G (s) k , G (t) k = max k A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k where P θ Z k |G (s) k , G (t) k is the marginal distribution of P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k with respect to Z k , since A (s) k and A (t) k are integrated over some distribution. Herein we can impose another distribution of the topology Q φ (A (s) k , A (t) k |G (s) k , G (t) k ) characterized by parameter φ, then we have: log A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k = log A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k Q φ (A (s) k , A (t) k |G (s) k , G (t) k ) Q φ (A (s) k , A (t) k |G (s) k , G (t) k ) = log   E Q φ (A (s) k ,A (t) k |G (s) k ,G (t) k )   P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k Q φ (A (s) k , A (t) k |G (s) k , G (t) k )     ≥E Q φ (A (s) k ,A (t) k |G (s) k ,G (t) k ) log P θ (Z, A (s) , A (t) |G (s) , G (t) ) -log Q φ (A (s) , A (t) |G (s) , G (t) ) (26) where the final step is derived from Jensen's inequality. Since optimizating Eq. 25 is difficult, we can alter to maximize the right hand side of inequality of Eq. ( 26) instead, which is the Evidence Lower Bound (ELBO) (Bishop, 2006) . Since two input graphs are handled separately by two identical subroutines (see Fig. 2a ), we can then impose the independence of topology A t) ). In this sense, we can utilize the same parameter φ to characterize two identical neural networks (generators) for modeling Q φ . (s) k and A (t) k : Q φ (A (s) , A (t) |G (s) , G (t) ) = Q φ (A (s) |G (s) )Q φ (A (t) |G ( Assuming θ is fixed, ELBO is determined by Q φ . According to Jensen's inequality, equality of Eq. ( 26) holds when: P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k Q φ A (s) k , A (t) k |G (s) k , G (t) k = c where c = 0 is a constant. We then have: A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k = c A (s) k ,A (t) k Q φ A (s) k , A (t) k |G (s) k , G (t) k As Q φ is a distribution, we have: A (s) k ,A (t) k Q φ A (s) k , A (t) k |G (s) k , G (t) k = 1 Therefore, we have: A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k = c We now have: Q φ A (s) k , A (t) k |G (s) k , G = P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k c = P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k A (s) k ,A (t) k P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k = P θ Z k , A (s) k , A (t) k |G (s) k , G (t) k P θ Z k |G (s) k , G (t) k =P θ A (s) k , A (t) k |Z k , G (s) k , G (t) k Eq. ( 31) shows that, once θ is fixed, maximizing ELBO amounts to finding a distribution Q φ approximating the posterior probability P θ A (s) k , A (t) k |Z k , G k , G . This can be done by training the generator Q φ to produce latent topology A given graph pair and the matching Z. This corresponds to the Inference part in Sec. 3.4.2.

A.4 ABLATION STUDY

In this part, we evaluate the performance of DLGM-D and DLGM-G by selectively deactivating different loss functions (refer Sec. 3.3 for more details of the functions). We also conduct the test on DLGM-G using different sample size of the generator. This ablation test is conducted on Pascal VOC dataset and average accuracy is reported in Tab. 5. We first test the performance of both settings of DLGM by selectively activate the designated loss functions. Experimental results are summarized in Tab. 5a. As matching loss L M is essential for GM task, we constantly activate this loss for all settings. We see that the proposed novel losses L C and L L can consistently enhance the matching performance. Besides, DLGM-G indeed delivers better performance than DLGM-D under fair comparison. We then test the impact of sample size from the generator Q φ under DLGM-G. Experimental results are summarized in Tab. 5b. We see that along with the increasing sample size, the average accuracy ascends. The performance becomes stable when the sample size reaches over 16. 

A.5 MORE VISUAL EXAMPLES AND ANALYSIS

We show more visual examples of matchings and generated topology using DLGM-G on Pascal VOC in Tab. 6 and Tab. 7, respectively. Each table follows distinct coloring regulation which will be detailed as follows: • Tab. 6. For each class, the left and right images corresponds to Delaunay triangulation. The image in the middle refers to the predicted matching and generated graph topology. Cyan solid and dashed lines correspond to correct and wrong matchings, respectively. Green dashed lines are the ground-truth matchings that are missed by our model. • Tab. 7. In this table, the leftmost and the rightmost columns correspond to original topology constructed using Delaunay triangulation. The two columns in the middle are the generated topology using our method given Delaunay triangulation as prior. Blue edges are the edges that Delaunay and generated ones have in common. Green edges corresponds to the ones that are in Delaunay but not in generated topology, while red edges are the ones that are generated but not in Delaunay. We give some analysis for the following questions. In what case a different graph is generated? Since there are some generated graphs are identical to Delaunay, this question may naturally arise. We observe that, DLGM tends to produce an identical graph to Delaunay when objects are rarely with distortion and graphs are simple (e.g. tv, bottle and plant in Tab. 6 and last two rows in Tab. 7). However, when Delaunay is not sufficient to reveal the complex geometric relation or objects are with large distortion and feature diversity (e.g. cow and cat in Tab. 6 and person in Tab. 7), DLGM will resort to generating new topology with richer and stronger hint for graph matching. In other words, DLGM somewhat finds a way to identify if current instance pair is difficult or easy to match, and learns an adaptive strategy to handle these two cases. Why DLGM-G delivers better performance than DLGM-D? In general, DLGM-D is a deterministic gradient-based method. That is, the solution trajectory of DLGM-D almost follows the gradient direction at each iteration (with some variance from minibatch). Though it is assured to reach a local optima, only following gradient is too greedy since generated graph is coupled with predicted matching. Besides, as the topology is discrete, the optimal continuous solution will have a large objective score gap to its nearest discrete sampled solution once the landspace of the neural network is too sharp. On the other hand, DLGM-G performs discrete sampling under feasible graph distribution at each iteration, which generally but not fully follows the gradient direction. This procedure can thus find better discrete direction with probability, hence better exploring the searching space. This behavior is similar to Reinforcement Learning, but with much higher efficiency. Additionally, EM framework can guarantee the convergence (Bishop, 2006) .



Without loss of generality, we discuss graph matching under the setting of equal number of nodes without outliers. The unequal case can be readily handled by introducing extra constraints or dummy nodes. Bipartite matching and graph isomorphism are subsets of this quadratic formulation(Loiola et al., 2007).2 There are some loosely related works(Du et al., 2019; 2020) on network alignment and link prediction without learning, which will be discussed in detail in the related works. We consider the case when only node feature E and topology A are necessary. Edge feature E can be readily integrated as another input.



. Z ∈ {0, 1} n×n , Hz = 1 (1)

Figure 1: Matching of BBGM (Rolínek et al., 2020) 11/13 with Delaunay triangulation and our DLGM-G 13/13 using generated graph (Pascal VOC). DLGM-G generates graph with 4 more edges than Delaunay (33 vs 29) for both source and target. But with 4 more common edges across source and target than Delaunay triangulation (26 vs. 22), it leads to better accuracy. Blue and red edges denote common edges in Delaunay and learned graph pairs.

Singleton pipeline of DLGM.

Example of consistency loss.

Figure 2: (a) One of the two branches of our DLGM framework (see the complete version in Appendix A.1). N B : VGG16 as backbone producing a global feature of input image, and initial X and E; N G : deterministic or generative module producing latent topology A; N R : SplineCNN for feature refinement producing updated X and E. (b) A schematic figure showing the merit of introducing consistency loss L c for training. Initial topology A(s) and A (t) are constructed using Delaunay triangulation. Given matching Z as guidance, latent topology A (s) and A (t) are generated from inputs A (s) and A(t) , respectively. Note the learned topology A (s) and A (t) are isomorphic (L c = 0) w.r.t. Z which is easier to match in test, comparing to non-isomorphic input structures (L c = 4).

Figure 3: Consistency and locality loss (Eq. (9) and (10)) keep decrease over training showing the effectiveness for adaptive topology learning for matching.

Figure 4: Holistic pipeline of DLGM consisting of two singleton pipelines.

Ablation test on Pascal VOC dataset. (a) Selectively deactivating loss functions on Pascal VOC. L M , L C and L L are selectively activated in DLGM-D and DLGM-G. "full" indicates all loss functions are activated. Average accuracy (%) is reported. (b) Average matching accuracy under different sampling sizes from the generator Q φ with "full" DLGM-G setting. (a) On losses method Ave DLGM-D (L M + L C ) 79.8 DLGM-D (L M + L L ) 79.5 DLGM-G (L M + L C ) 80.9 DLGM-G (L M + L L ) 80.Matching examples of DLGM-G on 20 classes of Pascal VOC. The coloring of graphs and matchings follows the principle of Fig. 1 in the manuscript. Zoom in for better view.

. Recent deep graph matching methods have shown how to extracte more dedicated feature representation. The work (Zanfir & Sminchisescu, 2018) adopts VGG16 (Simonyan & Zisserman, 2014) as the backbone for feature extraction on images. Other efforts have been witnessed in developing more advanced pipelines, where graph embedding

Accuracy (%) on Pascal VOC (best in bold). Only inlier keypoints are considered.

F1-score (%) on Pascal VOC. Experiment are performed on a pair of images where both inlier and outlier keypoints are considered. BBGM-max is a setting inRolínek et al. (2020). method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Ave

Accuracy (%) on SPair-71K compared with state-of-the-art methods (best in bold). method aero bike bird boat bottle bus car cat chair cow dog horse mbike person plant sheep train tv Ave

Accuracy (%) on Willow Object.

is much larger than Pascal VOC and WillowObject since it consists of 70,958 image pairs collected from Pascal VOC 2012 and Pascal 3D+ (53,340 for training, 5,384 for validation and 12,234 for testing). It improves Pascal VOC by removing ambiguous categories sofa and dining table.

Tianshu Yu, Junchi Yan, Yilin Wang, Wei Liu, et al. Generalizing graph matching beyond quadratic assignment model. In NIPS, 2018. Tianshu Yu, Runzhong Wang, Junchi Yan, and Baoxin Li. Learning deep graph matching with channel-independent embedding and hungarian attention. In ICLR, 2020. A. Zanfir and C. Sminchisescu. Deep learning of graph matching. In CVPR, 2018. Si Zhang and Hanghang Tong. Final: Fast attributed network alignment. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1345-1354. ACM, 2016. Zhen Zhang and Wee Sun Lee. Deep graphical feature learning for the feature matching problem. In ICCV, 2019. F. Zhou and F. Torre. Factorized graph matching. IEEE PAMI, 2016.

Delaunay 1

Generated 1 Generated 2 Delaunay 2Table 7 : Generated topology compared with original Delaunay triangulation in a pairwise fashion. Note the 1st and the 4th columns correspond to two input images with topology constructed by Delaunay triangulation, respectively. 2nd and 3rd columns are the generated topology given Delaunay results as prior.

