REVOCABLE DEEP REINFORCEMENT LEARNING WITH AFFINITY REGULARIZATION FOR OUTLIER-ROBUST GRAPH MATCHING

Abstract

Graph matching (GM) has been a building block in various areas including computer vision and pattern recognition. Despite recent impressive progress, existing deep GM methods often have obvious difficulty in handling outliers, which are ubiquitous in practice. We propose a deep reinforcement learning based approach RGM, whose sequential node matching scheme naturally fits the strategy for selective inlier matching against outliers. A revocable action framework is devised to improve the agent's flexibility against the complex constrained GM. Moreover, we propose a quadratic approximation technique to regularize the affinity score, in the presence of outliers. As such, the agent can finish inlier matching timely when the affinity score stops growing, for which otherwise an additional parameter i.e. the number of inliers is needed to avoid matching outliers. In this paper, we focus on learning the back-end solver under the most general form of GM: the Lawler's QAP, whose input is the affinity matrix. Especially, our approach can also boost existing GM methods that use such input. Experiments on multiple real-world datasets demonstrate its performance regarding both accuracy and robustness.

1. INTRODUCTION

Graph matching (GM) aims to find node correspondence between two or multiple graphs. As a standing and fundamental problem, GM spans wide applications in different areas including computer vision and pattern recognition. With the increasing computing resource, graph matching that involves the second-order edge affinity (in contrast to the linear assignment problem e.g. bipartite matching) becomes a powerful and relatively affordable tool for solving the correspondence problem with moderate size, and there is growing research in this area, especially with the introduction of deep learning in recent years (Zanfir et al., 2018; Wang et al., 2019b) . GM can be formulated as a combinatorial optimization problem namely Lawler's Quadratic Assignment Problem (Lawler's QAP) (Lawler, 1963) , which is known as NP-hard. Generally speaking, handling the graph matching problem involves two steps: extracting features from input images to formulate a QAP instance and solving that QAP instance via constrained optimization, namely front-end feature extractor and back-end solver, respectively. Impressive progress has been made for graph matching with the introduction of rich deep learning techniques. However, in existing deep GM works, the deep learning modules are mainly applied on the front-end, especially for visual images using CNN for node feature learning (Zanfir et al., 2018) and GNN for structure embedding (Li et al., 2019) . Compared with learning-free methods, learnable features have shown more effectiveness. Another advantage of using neural networks is that the graph structure information can be readily embedded into unary node features, as such the classic NP-hard QAP in fact can degenerate into the linear assignment problem, which can be readily solved by existing back-end solvers Towards practical and robust graph matching learning, in the absence of labels and in the presence of outliers (in both input graphs), we propose a reinforcement learning (RL) method for graph matching namely RGM, especially for its most general QAP formulation. In particular, RL is conceptually well suited for its label-free nature and flexibility in finding the node correspondence by sequential decision making, which provides a direct way of avoiding outlier over-matching by an early stopping. In contrast, in existing deep GM works, matching is performed in one shot which incurs coupling of the inliers and outliers, and it lacks an explicit way to distinguish outliers. Therefore, we specifically devise a so-called revocable deep reinforcement learning framework to allow small mistakes over the matching procedure, and the current action is revocable to research a better node correspondence based on up-to-date environment information. Our revocable framework is shown cost-effective and empirically outperforms the existing popular techniques for refining the local decision making e.g. Local Rewrite (Chen & Tian, 2019) . Moreover, since the standard GM objective refers to maximizing the affinity score between matched nodes, it causes the over-matching issue i.e. the outliers are also incorporated for matching to increase the overall score. To address this issue, we propose to regularize the affinity score such that it discourages unwanted matchings by assigning a negative score to those pairs. Intuitively, the RL agent will naturally stop matching spurious outliers as the objective score will otherwise decrease. With the help of the revocable framework and affinity regularization, our RGM shows promising performances in various experiments. For more clearance, we compare our RGM with most of the existing GM methods in Table 5 . Due to the space limit, we place it in the appendix (A.1), where we add a more detailed discussion of the comparison of RGM and existing works. To sum up, the highlights and contributions of our work are: 1) We propose RGM that sequentially selects the node correspondences from two graphs, in contrast to the majority of existing works that obtain the whole matching in one shot. Accordingly, our approach can naturally handle the case for partial matching (due to outliers) by early stopping. 2) Specifically, we first devise a revocable approach to select the possible node correspondence, whose mechanism is adapted to the unlabeled graph data with the affinity score as the reward. To the best of our knowledge, this is the first attempt to successfully adapt RL to graph matching. 3) For avoiding matching the outliers, we develop a regularization to the affinity score, making the solver no longer pursue to match as many nodes as possible. To our best knowledge, this is also the first work for regularizing the affinity score to avoid over-matching among outliers. 4) On synthetic datasets, Willow Object dataset, Pascal VOC dataset, and QAPLIB datasets, RGM shows competitive performance compared with both learning-free and learning-based baselines. Note that RGM focuses on learning the back-end solver and hence it is orthogonal to many existing front-end feature learning based GM methods, which can further boost the front-end learning solvers' performance as shown in our experiments.

2. RELATED WORKS

Graph matching. It aims to find the node correspondence by considering both the node features as well as the edge attributes, which is known as NP-hard in its general form (Loiola et al., 2007) . Classic methods mainly resort to different optimization heuristics e.g. random walk (Cho et al., 2010) , spectral matching (Leordeanu & Hebert, 2005) , path-following (Zhou & Torre, 2016) , graduated assignment (Gold et al., 1996) , to SDP relaxation (Schellewald & Schnörr, 2005) . Deep learning of GM. Since the seminal work (Zanfir et al., 2018) , deep neural networks have been devised for graph matching (Yan et al., 2020) . Among the deep GM methods, we generally discuss two representative lines of research as follow. The first line of works (Wang et al., 2019b; Yu et al., 2020) apply CNNs or/and GNNs for learning the node and structure features. By using node embedding, the problem degrades into linear assignment that can be optimally solved by the Sinkhorn network (Cuturi, 2013) to fulfill double-stochasticity which is non-learnable. Instead, another line of studies (Wang et al., 2020c) follow the general Lawler's QAP form exactly, and the problem becomes combinatorial selecting of nodes on the association graphs. Yet all the above models adopt supervised learning and there is little reported success using RL to solve GM despite their label-free advantage and popularity in solving other combinatorial problems. Matching against outliers. Matching against outliers has been a long standing task especially for vision, and seminal works date back to RANSAC (Fischler & Bolles, 1981) . There are efforts (Torresani et al., 2012; Yang et al., 2015; Yi et al., 2018; Zhang et al., 2019; Liu et al., 2020) in exploring the specific problem structure and clues in terms of spatial and geometry coherence to achieve robust matching. While we focus on the general setting of GM without using additional assumptions or parametric transform models, the relevant pair-wise graph matching works are relatively less crowded. Wang et al. (2019a) utilizes the strategy based on domain adaptation, which removes the outliers by data normalization module. A heuristic max-pooling strategy is developed in Cho et al. (2014) for dismissing excessive outliers. Wang et al. (2020a) proposes to suppress the matching of outliers by assigning zero-valued vectors to the potential outliers. However, all the above methods are learning-free and the trending learning-based solvers to our best knowledge, have not addressed the outlier problems explicitly except for adding dummy nodes (Wang et al., 2021a) .

Reinforcement learning for combinatorial optimization.

There is growing interest in using RL in solving combinatorial optimization problems (Bengio et al., 2021) , such as traveling salesman problem (Khalil et al., 2017) , vehicle routing problem (Nazari et al., 2018) , job scheduling problem (Chen & Tian, 2019) , maximal common subgraph (Bai et al., 2021) , The main challenges of these approaches are designing suitable problem representation and tuning reinforcement learning algorithms. For combinatorial optimization problems on single graph, pointer networks (Vinyals et al., 2015) and GNN (Kipf & Welling, 2017) are the most widely used representations. However, for graph matching, there are two graphs for input and the agent needs to pick a node from each of the two graphs every step. To our best knowledge, GM has not been (successfully) addressed by RL.

3. PRELIMINARIES

In this paper, we mainly focus on two graph matching i.e. pairwise graph matching, which is also known as pairwise graph matching. Specifically, we consider a more difficult situation, where there are some outliers in both two graphs, and we want to match the similar inliers while ignoring outliers. Given two weighted graphs G 1 and G 2 , we aim to find the matching between their nodes such that the affinity score is maximized. We use V 1 and V 2 to represent the nodes of graph G 1 and G 2 . We suppose that |V 1 | = n 1 , |V 2 | = n 2 , and there can be outliers in G 1 , G 2 , or both. E 1 and E 2 denote the edge attributes of graph G 1 and G 2 . The affinities of pairwise graph matching include the first order (node) affinities and the second order (edge) affinities. Generally, the graph matching problem can be regarded as Lawler's Quadratic Assignment Problem (Lawler, 1963) : J(X) = vec(X) K vec(X), s.t. X ∈ {0, 1} n1×n2 , X1 n2 ≤ 1 n1 , X 1 n1 ≤ 1 n2 (1) where X is the permutation matrix of which the element is 0 or 1, X(i, a) = 1 denotes node i in graph G 1 is matched with node a in graph G 2 . The operator vec(•) means column-vectorization. K ∈ R n1n2×n1n2 is the affinity matrix. For node i in G 1 and node a in G 2 the node-to-node affinity is encoded by the diagonal element K(ia, ia), while for edge ij in G 1 and edge ab in G 2 the edgeto-edge affinity is encoded by the off-diagonal element K(ia, jb). Assuming i, a both start from 0, 1 2 a b c 1c 2b 1a 2c K1a,2b K1a,1a K2b,2b 1b 2a 3 3b 3a 3c G 1 G 2 G a 1 2 a b c 3 G 1 G 2 G 1 G 2 G 1 G 2 1c 2b 1a 2c 1b 2a 3b 3a 3c G a 1c 2b 1a 2c 1b 2a 3b 3a 3c G a 1c 2b 1a 2c 1b 2a 3b 3a 3c G a 1c 2b 1a 2c 1b 2a 3b 3a 3c G a Figure 1 : Left: the association graph (G a on the top) can be derived from the raw input graphs (Gfoot_0 , G 2 at the bottom). Right: the matching process of our RL procedure: the blue vertices denote available vertices, the green vertices denote selected (matched) vertices, and the blurred vertices denote the unavailable vertices. The agent selects "1a", "2b", and "3c" progressively on G a . the index ia means i × n 2 + a. The objective is to maximize the sum of both first order and second order affinity score J(X) given the affinity matrix K by finding an optimal permutation X. Graph matching involves (at least) two input graphs. Instead of directly working on two individual graphs which can disclose raw data information and might be sensitive, in this paper, we first construct the association graph of the pairwise graphs as the input representation (Leordeanu & Hebert, 2005; Wang et al., 2021a) . Specifically, we construct an association graph G a = (V a , E a ) from the original pairwise graph G 1 and G 2 , with the help of the affinity matrix K. We merge each two nodes (v i , v a ) ∈ V 1 × V 2 as a vertex v p ∈ V a . We can see that the association graph contains |V a | = n 1 × n 2 vertices 1 . There exists an edge for every two vertices as long as they do not contain the same node from the original pairwise graphs, so every vertex is connected to (n 1 -1) × (n 2 -1) edges. There exist both vertex weights w(v p ) and edge weights w(v p , v q ) in the association graph, where the vertex and edge weights denote the first and second order affinities: F(p, p) = w(v p ) = K(ia, ia), where p = ia W(p, q) = w(v p , v q ) = K(ia, jb), where p = ia, q = jb (2) where the vertex index p in the association graph G a means a combination of the indices i and a in the original pairwise graphs G 1 and G 2 . F, W ∈ R n1n2×n1n2 are the weight matrices that contain the vertex weights and edge weights in the association graph, respectively. Fig. 1 shows an example to construct the association graph from the input graphs. In the association graph G a , selecting a vertex p in the association graph equals to matching nodes i and a in the input graphs. Therefore, we can select a set of vertices U as we choose to match these nodes as the solution. Note that the set of vertices U in the association graph is equivalent to the permutation matrix X and can be easily converted, as long as the set U does not violate the constraint in Eq. 1.

4.1. REINFORCEMENT LEARNING FOR GRAPH MATCHING

We design our RGM based on Double Dueling DQN (D3QN) (Hasselt et al., 2016) , as it is widely accepted (Sutton & Barto, 2018 ) that value based RL algorithms like D3QN can better handle the discrete case, which will be also verified in our ablation study. The pipeline of the training process of RGM is described in Alg. 1. Details of the proposed method are given in Appendix (A.2), including network structure (A.2.1), prioritized experience replay memory (A.2.2) and model updating algorithm (A.2.3). We show the design of state, action, and reward in our RGM: 1) State. The state s is the current (partial) solution U , where U is also a set of vertices in the association graph, with |U | ≤ min(n 1 , n 2 ). The size of U increases from 0 at the beginning, and finally partial solution U becomes a complete solution U when the agent decides to stop the episode. 2) Action. The action a of our reinforcement learning agent is to select a vertex in the association graph and add it to the current solution U . By the definition of graph matching, we can not match two nodes in G 1 to one same node in G 2 and vice versa. Therefore, in our basic RL framework, we can only select the vertices in the available vertex set. Take Fig. 1 for an example, once we select the vertex "1a", it means we have matched node "1" in G 1 and node "a" in G 2 . Then, we can not match node "1" to node "b" or "c" later, which means we can not select vertices "1b" or "1c". Given partial solution U , the available vertices set V is written as: V = {v | A(v, v ) = 1, ∀ v ∈ U , v ∈ V a } (3) where V a is the vertices in the association graph, whose adjacency matrix (0 for no edge connected and 1 for existing edge connected) is A. Eq. 3 holds since two vertices are connected if they do not contain the same node from the input graphs. If a vertex is connected to all vertices in U , then it has no conflict. Given the available set, we mask all unavailable vertices in the association graph to make sure that the agent can not select them, as illustrated by the blurred vertices in Fig. 1 . Then, the action is to pick a node v from the available vertices set V: U old v∈V --→ U new , where U old is the old partial solution, and U new is the new partial solution after an action. It is worth noting that, the requirement that the agent needs to select the vertex from the available set only exists in our basic RL framework. In our later proposed revocable RL framework, the agent can select any vertex without constraint. 3) Reward. We define the reward r as the improvement of the objective score between the old partial solution and the new partial solution after executing an action.

4.2. THE REVOCABLE ACTION MECHANISM

So far we have presented a basic RL framework. In such a vanilla form, it can not undo the actions that have been executed. In other words, the agent can not "regret", which means the agent has no chance to adjust if the agent makes a wrong decision and the error may accumulate until obtaining a disastrous solution. To strike a reasonable trade-off between efficiency and efficacy, we develop a mechanism to allow the agent to re-select the vertex on the association in one revocable step. To allow the agent to modify the decisions made before, we design a new revocable RL framework. We remove the available set used before, and the agent is free to choose any vertex, even if it is in conflict. Then, we modify the strategy in our RL environment. If the environment receives a new vertex from the agent's action that is in conflict with currently selected vertices, the environment will remove one or two vertices that are in conflict with the new coming vertex, and then add the new vertex to the current solution. Our proposed revocable RL framework is illustrated in Fig. 2 . As Fig. 2 shows, the pipelines of our revocable RGM are a little different from our basic framework. The available set in the basic RGM does not exist if the revocable mode is on. Pipeline (A) shows a simple situation that the agent matched the vertices directly without any reverse operation. In pipeline (B), we suppose the agent chooses the vertex "2c" for the second action, which is not a good choice. When choosing the third action, the agent realizes that it made a mistake by selecting "2c", therefore, the agent chooses "2b" to fix this mistake. Then, this action is passed to the environment, and the environment reverts "2c" and selects "2b" instead. In other words, the agent can reverse "2c" to "2b". In pipeline (C), we show another revocable situation. Suppose that the agent selects "2c" and "3b" as its second and third actions, and acquires a complete matching solution. However, it turns out the matching solution ("1a", "2c", and "3b") is not as good as expected. To roll this situation back, the agent can select "2b" as the next action. By selecting "2b", the environment will release the vertices "2c" and "3b", and then select "2b". Finally, the agent chooses "3c" as the last action, and decides to end the episode with the matching solution ("1a", "2b", and "3c"). 0.3 0.5 -0.4 0.2 1.1 0.1 0.1 0.1 0.7 0.4 0.1 0.2 -0.3 1 -5.5 1.4 1.3 1.7 1.5 1.3 1.3 1.3 1.6 1.4 -5.6 1.5 1.6 1.3 1.5 1.3 1.8 1.3 1.3 1.7 -5.4 1.3 1.5 1.7 1.7 1.3 1.3 1.6 1.5 1.3 -5.6 1.3 1.8 1.7 1.5 1.3 1.9 1.3 1.6 1.3 -4.6 1.3 1.5 1.3 1.5 1.3 1.8 1.4 1.5 1.3 -5.6 1.3 1.6 1.7 1.3 1.3 1.7 1.4 1.6 1.3 -5.4 1.5 1.3 1.3 1.5 1.3 1.8 1.3 1.5 1.7 -5.6 1.4 2.3 1.3 1.3 1.3 1.9 1.6 1.3 1.4 -5.5 am.xlsx (a) K (b) K 1 reg (c) K 2 reg (d) K 3 reg Above all, our proposed revocable reinforcement learning framework allows the agent to make better decisions by giving the opportunity to turn the way back. From now on in the paper, the default setting of RGM contains the revocable action framework. For further discussion, especially the difference from other similar frameworks (especially the popular Local Rewrite (Chen & Tian, 2019) and ECO-DQN (Barrett et al., 2020) ), please refer to the appendix(A.2.4).

4.3. OUTLIER-ROBUST GRAPH MATCHING

For practical GM, outliers are common in both the input pairwise graphs. In general, the solution is supposed to contain only the inlier correspondences without outliers. However, most existing GM methods are designed to match all keypoints whether they are inliers or outliers. This design is based on pursuing the highest objective score, and matching outliers also can increase the objective score to some extent. We believe that the outliers should not be matched in any sense. Therefore, we propose two strategies to guide the agent to match the inliers only and ignore the outliers.

4.3.1. INLIER COUNT INFORMATION EXPLORATION

Our first strategy is to inform the agent of the amount of common inliers. Similar input settings can also be found in learning-free outlier-robust GM methods (Yang et al., 2017; Wang et al., 2020a) . Given the exact number of inliers n i , the sequential matching of RGM can be readily stopped when the size of the current solution |U | = n i , leaving the remaining n 1 -n i and n 2 -n i nodes as outliers.

4.3.2. AFFINITY REGULARIZATION VIA QUADRATIC APPROXIMATION

Our second strategy is to regularize the affinity score (i.e. objective score). As mentioned before, existing GM methods tend to match all keypoints including outliers to make the affinity score as high as possible. Therefore, one straightforward idea is to regularize the affinity score which exerts penalization on over-matching terms to balance the effect of the outliers. In this paper, we propose to regularize the original affinity score that blindly adds up all the node/edge correspondence affinity values including the outliers. In other words, we aim to design an inlier-aware regularized affinity score denoted by J reg (X) to dismiss the effect of outliers in affinity score computing, as the affinity score from the inlier matching is more meaningful. Specifically, the regularization is devised as a function w.r.t. the number of matched keypoints in X, as denoted by f ( vec(X) 1 ), which is multiplied on the original affinity score as follows: J reg (X) = vec(X) Kvec(X) • f ( vec(X) 1 ) In general, f ( vec(X) 1 ) shall become smaller when there are more matched keypoints to suppress spurious outliers. For the effectiveness, in this paper we choose the following three functions without loss of generality: f 1 (n) = 3 max(n1,n2)-n 3 max(n1,n2) , f 2 (n) = 1+n 1+3n , f 3 (n) = 1 n 2 . However, the above formula is unfriendly for learning in RGM as our core network GNN can only accept the form of the affinity matrix K namely the association graph as input, while the impact of f ( vec(X) 1 ) can not be delivered to the GNN. To fill this gap, in the following, we further propose a technique to transform Eq. 4 into a standard QAP formulation, and construct the regularized affinity matrix K reg as the input for the GNN, instead of the original affinity matrix K. Recall the original score in Eq. 1 and we further denote n x = vec(X) 1 . Eq. 4 can be rewritten as: J reg (X) = vec(X) Kvec(X) -J • (1 -f (n x )) ≈ vec(X) Kvec(X) -J • g(n x ) (5)  (a) f 1 (n) = 3•max(n1,n2)-n 3•max(n1,n2) (b) f 2 (n) = 1+n 1+3•n (c) f 3 (n) = 1 n 2 Figure 4: The empirical relation between F1 and original/regularized affinity scores. The affinity score of the original affinity matrix is calculated by vec(X pred ) Kvec(X pred ) vec(X gt ) Kvec(X gt ) , and affinity score of the regularized affinity matrix is vec(X pred ) Kregvec(X pred ) vec(X gt ) Kregvec(X gt ) . The score of ground truth is constantly 1. In the above formula, it is worth noting that to make the above formula a quadratic one, we introduce a quadratic function: g(n) = an 2 + bn + c, which is used to approximate the term g(n) ≈ 1 -f (n), in a way of least square fitting to determine the unknown coefficients a, b, c. Technically the fitting involves sampling n in a certain range according to the prior of matching problem at hand, e.g. n ∈ {7, 8, 9, 10, 11, 12}. Then, g(n) can be expanded by: (K A = 1 (all-one matrix) and K B = I) g(n x ) =a ijkl X(i, j) • X(k, l) + b ij X(i, j) • X(i, j) + c =a • vec(X) K A vec(X) + b • vec(X) K B vec(X) + c =vec(X) (aK A + bK B )vec(X) + c (6) During the iteratively problem solving in RGM, we note that J changes relatively slowly between two consecutive iterations, and thus we treat it as a constant during iteration. Based on this observation, we then try to approximately convert J reg into a QAP formulation as follows: J reg (X) ≈ vec(X) (K -aJ • K A + bJ • K B )vec(X) which is friendly to GNN learning. Let K reg = K -aJ • K A -bJ • K B and Eq. 5 becomes: X = arg max X vec(X) Kvec(X) • f ( vec(X) 1 ) ≈ arg max X vec(X) K reg vec(X) In this way, we now can input the regularized affinity matrix K reg to the GNN, to better learn the impact of the regularization term f ( vec(X) 1 ). Fig. 3 shows the results of the regularized affinity matrix. Notably, some values in the regularized affinity K reg become negative. Intuitively, the negative elements in the affinity matrix denote that the affinity score maximization reward may no longer pursue to match as many keypoints pairs as possible, which prevents the agent from picking up outliers to some extent. Besides, most traditional graph matching solvers (Cho et al., 2010; Egozi et al., 2013) assume the affinity matrix is non-negative, while our RGM has no such restriction. Fig. 4 illustrates the effectiveness of regularized affinity score, with three different regularization functions {f i (n)} i=1,2,3 . We construct a set of permutation solutions by GM solver, and calculate the F1 score and affinity score of the original affinity matrix and regularized affinity matrix. As one can see, the affinity score of the original affinity matrix fluctuates a lot with the increase of the F1 score. Besides, there are some cases where original affinity is larger than ground truth matching. In contrast, the regularized affinity score of the regularized affinity matrix is more stable and overall nearly positively correlates to the F1 score. It proves that the regularized affinity score is a better optimization objective that is consistent with matching accuracy (F1 score). This approximate optimization process can be merged into our revocable framework as shown in Alg. 1. In the sequential decision making process, at every step when the agent needs to select the next vertex, we make an approximate optimization based on the current solution, and get the regularized affinity matrix K reg . Then, we feed the regularized affinity matrix K reg instead of the original affinity matrix K to the GNN. The above introduced regularization technique can be readily adopted in our sequential node matching scheme. In contrast, it is nontrivial for them to be integrated with existing deep GM methods which are mainly performed in one shot for the whole matching.

5. EXPERIMENTS

We perform experiments on the standard benchmarks including image datasets (synthetic image, Willow Object, Pascal VOC) as well as pure combinatorial optimization instances (QAPLIB), as (Gold et al., 1996) 69.17% 0.8488 62.99% 0.7657 60.46% 0.7365 57.00% 0.7133 53.85% 0.6675 52.71% 0.6658 IPFP (Leordeanu et al., 2009) the latter is suited for back-end solver. Due to the page limit, the detailed experiment protocols are introduced in the appendix (A.3), including hyperparameter settings (A.3.1), evaluation metrics (A.3.2), compared methods (A.3.3), dataset settings (A.3.4), and the train/test protocols (A.3.5).

5.1. EXPERIMENTS ON WILLOW OBJECT DATASET WITH OUTLIERS

1) Performance over Different Amounts of Outliers. We conduct more experiments with respect to different amounts of outliers. In these experiments, the number of inliers is fixed at 10 while the number of outliers varies from 1 to 6. The experiment setup follows Jiang et al. (2021) , and the evaluation is performed on all five categories. For the training and testing process, we follow the data split rate in BBGM (Rolínek et al., 2020) in all experiments. Table 1 shows the performance of our method and baselines in the average of all five classes on the Willow Object dataset with respect to different amounts of outliers (from 1 to 6). For our methods at the bottom, "RGM + AR" denotes RGM with affinity regularization, "RGM + IC" denotes RGM with inlier count information, and "RGM + AR + IC" denotes RGM with both. For ablation, RGM without the revocable mechanism is shown in the last three columns, denoted by "w/o rev.". Regularized affinity and inlier count information are two ways to handle the outliers introduced in Sec. 4.3. RGM reaches the best results over all five classes in this dataset, and gets a 4% improvement in average compared to the best baseline BBGM. For the variants of our methods, "RGM + AR + IC" shows the best performance, while "RGM + AR" requires no extra information and still outperforms all the baselines. 2) Sensitivity to the Input of Inlier Count. Since in reality, the inlier count information might not be always accurate, we test our methods' robustness against the inexact number of inliers. We reconduct experiments, of which each image contains 10 inliers and 3 outliers. We modify the inlier count info n i to (8 -13) instead of ground truth 10, as the input to "RGM + IC" and "RGM + AR + IC". The results in Table 2 show that the performances of these two methods tend to downgrade given incorrect information, as expected, but they can still outperform BBGM in cases when the information is only a little biased. Meanwhile, the regularized affinity matrix can also enhance the robustness. As a result, our RGM can still work well even the inlier count information is inaccurate.

5.2. EXPERIMENTS ON PASCAL VOC

Table 3 reports the results on the Pascal VOC. We also apply the train/test split rate as mentioned in Sec. A.3.5. We show the average performance of every 20 classes in this dataset, and we can see that RGM outperforms all baselines on Pascal VOC. Specifically, the baselines on the left side are We can see that our RGM reaches the highest objective score 1.040, which is clearly greater than 1. However, even with this high objective score, the matching accuracy of RGM is still unsatisfactory. Then, we conduct experiments with methods on the right side that requires label as supervision, which means their backend-solvers are trained to reach the higher accuracy instead of the objective score. We combine our RGM with SOTA method NGM and NGM-v2, by using their well-trained model as the guidance of RGM for imitation learning. Then, we can find that RGM can boost the performance of original methods with the modified direction, which expands the usage of our RGM.

5.3. EXPERIMENTS ON QAPLIB DATASET

For QAPLIB (Burkard et al., 1997) , we use the settings exactly the same as NGM. The results are shown in Table 4 , where "esc(16-64)" denotes that the size of class "esc" varies from 16 to 64. The train-test split rate for this dataset is (1 : 1). We calculate the gap between the computed solution and the optimal, and report the average optimal gap (the lower the better). Besides, inference time per instance is listed in the last column. We compare with four existing solvers: SM (Leordeanu & Hebert, 2005) , Sinkhorn-JA (Kushinsky et al., 2019) , RRWM (Cho et al., 2010) and NGM (Wang et al., 2021a) . Note that the NGM is the first that utilizes deep learning to solve QAP which is an emerging topic. It shows that RGM outperforms all the baselines, including the latest solver NGM. 

6. CONCLUSION

We have presented a deep RL based approach for graph matching, especially in the presence of outliers. The sequential decision scheme allows to natural select the inliers for matching and avoids matching outliers. To our best knowledge, it is the first work for RL of graph matching which can be applied to its general QAP form. We further devise two techniques to improve the robustness. The first is the revocable action mechanism which is shown well suited to our complex constrained search procedure. The other is the affinity regularization based on parametric function fitting, which is shown can effectively refrain the agent from matching outliers when the number of inliers is unknown. Experiments on multiple real-world datasets show the cost-effectiveness of RGM. Remark: Despite the benefits of label-free and timely node correspondence generation, our RL module cannot be combined with front-end models for joint feature and solver learning, as mentioned in Sec. 1: under the RL framework, at least by our presented reward based on the affinity objective score, it is impossible to jointly train the front-end models with the back-end solver. This is because the reward itself is a function w.r.t. the model parameters of the front end appearance and structure model e.g. CNN/GNN. While in the supervised NGM (Wang et al., 2021a) , the ground truth node correspondences as used for loss are computationally irrelevant to the front-end modules thus joint learning is feasible. Only NGM (Wang et al., 2021a) /LCS (Wang et al., 2020c ) (these concurrent works are essentially similar at core idea) and RGM can be directly applied to QAP given an input affinity matrix. See results in experiments on QAPLIB in Sec. 5.3.

A APPENDIX

A.1 DISCUSSION WITH EXISTING WORK Instead of the Lawler's QAP, several GM works choose the Koopmans-Beckmann's QAP (Koopmans & Beckmann, 1957) , which requires the explicit input of two graphs. We argue that such raw information may not always be available in practice e.g. for privacy. For graph matching, the most general form is Lawler's QAP (Lawler, 1963) whose input is the pairwise affinity matrix whereby the raw graph information is removed and the Koopmans-Beckmann's QAP is a special case for Lawler's QAP. There are also standing and widely adopted public benchmarks for Lawler's QAP e.g. QAPLIB (Burkard et al., 1997) . For its generality and popularity, SOTA deep GM works (Wang et al., 2020c; 2021a; Rolínek et al., 2020) follow this line, whereby a GNN model is applied on the so-called association graph whose weighted adjacency matrix is the affinity matrix. This GNN model is trained for node embedding on the association graph, which selects the node correspondence via supervised learning in one shot. Since Lawler's QAP only requires the affinity matrix and keeps the node and edge information of user unknown, the privacy can be better retained than explicitly using the input graph as done in the Koopmans-Beckmann's QAP. Therefore, Lawler's QAP has been adapted to several privacy-sensitive tasks, including model fusion (Liu et al., 2022) , bioinformatics (Zaslavskiy et al., 2009) , and text alignment (Fey et al., 2020b) . As mentioned in the paper, utilizing label-free revocable RL with affinity regularization for designing a new back-end solver becomes a promising tool for pushing the frontier of graph matching research. However, here we emphasize our method is focused on the back-end part whose input is the affinity matrix, and our method cannot be combined with learnable CNN and GNN for input graph feature extraction and metric learning part, for joint differentiable front-back-end learning. The reason is that, as will be shown in our approach, the RL reward is parameterized by the frontend CNN/GNN/MLP, which makes end-to-end impossible. For the above reason, our RL solver is trained on the input affinity matrix, by fixing the parameters of the front-end models which can be pretrained via existing supervised methods (Wang et al., 2019b) . Table 5 compares existing works for their learning modules and techniques. This protocol is akin to the QAP learning part in (Wang et al., 2021a) , which can also be regarded as the inherent limitation with RL. Fortunately, as will be shown in our experiments, our method still can outperform end-to-end supervised methods, especially with a large ratio of outliers. Our two-stage training pipeline is also more efficient than joint learning, and thus suitable for our method as RL is more costly than supervised learning. Input K reg and s to GNN in Eq. 9; 10 else 11 Input K and s to GNN in Eq. 9; 12 Get Q from the Q-value network in Eq. 10; 13   -------------------- -------------------- 31 # Updating the neural networks: -------------------  32 cnt ← cnt + 1; 33 if cnt % c 1 == 0 then 34 Calculate L(f θ ; M) by Eq. 12; 35 Update f θ : θ ← θ -η∇ θ L(f θ ; M); 36 Update the transition priority in M; 37 if cnt % c 2 == 0 then 38 Update f θ -: f θ -← f θ ; 39 -

A.2 DETAILS OF THE PROPOSED METHOD

The idea of RL is to learn from the interactions between the agent and the environment (Sutton & Barto, 2005) . The agent's observation of the current environment is called state s. The agent chooses an action a given the current state by a specific policy. After the agent performs an action, the environment will transfer to another state s . Meanwhile, the environment will feedback to the agent with a reward r. This pipeline solves the problem progressively. For graph matching, "progressively" means to select the vertex in the association graph one by one. The environment is defined as a partial solution to the original combinatorial problem (Eq. 1) and equivalently the association graph, where the reward denotes the improvement of the objective function by matching a new pair of nodes. The interactions between the agent and the environment are recorded as a transition (s, a, r, s ) into an experience replay memory M. After several episodes, the agent updates its networks f θ according to the transitions sampled from M. A.2.1 NETWORK STRUCTURE 1) State Representation Networks. To better represent the current state on the association graph, we choose graph neural networks (GNN) (Kipf & Welling, 2017) to compute its embedding. GNN extracts the vertex features based on their adjacent neighbors. To better use the edge weights in the association graph, we derive from the idea of struct2vec (Dai et al., 2016) . In our embedding networks, the current solution, node weights, and edge weights of the association graph are considered. The embedding formula is: E t+1 = ReLU(h 1 + h 2 + h 3 + h 4 ) h 1 = X • θ 1 , h 2 = A • E t (n 1 -1)(n 2 -1) • θ 2 h 3 = A • F • θ 3 (n 1 -1)(n 2 -1) , h 4 = ReLU(W • θ 5 ) (n 1 -1)(n 2 -1) • θ 4 (9) where E t ∈ R n1n2×d denotes the embedding in the t-th iteration, with d as the hidden size. At every iteration, the embedding is calculated by four hidden parts h 1 , h 2 , h 3 , h 4 ∈ R n1n2×d . θ 1 ∈ R d , θ 2 ∈ R d×d , θ 3 ∈ R d , θ 4 ∈ R d×d and θ 5 ∈ R d are the weight matrices in the neural networks. t is the index of the iteration and the total number of the iterations is T . We set the initial embedding E 0 = 0 and use ReLU as the activation function. As for the hidden parts, each hidden part represents a kind of feature: h 1 is to calculate the impact of current permutation matrix X which is transformed from the current partial solution U . h 2 is to take neighbor's embedding into consideration, where A is the adjacency matrix of the association graph and divide (n 1 -1)(n 2 -1) is for average since every vertex has (n 1 -1)(n 2 -1) neighbors. h 3 calculates the average of neighbor's vertex weights, where F is the vertex weight matrix. h 4 is designed to extract the features of adjacent edges, where W is the edge weight matrix. Please note that the core inputs of our GNN are the permutation matrix X and the affinity matrix K (A, F and W are derived from the affinity matrix K). 2) Q-Value Estimation Networks. The Q-learning based algorithms use Q(s, a) to represent the value of taking action a in state s, as an expected value of the acquired reward after choosing this action. The agent picks the next action given the estimation of Q(s, a). The Q-value estimation network f θ takes the embedding of the current state as input and predicts the Q-value for each possible action. We adopt Dueling DQN (Wang et al., 2016) as our approximator to estimate the Q-value function. The architecture of our f θ is: h 5 = ReLU(E • θ 6 + b 1 ), h v = h 5 • θ 7 n 1 n 2 + b 2 , h a = h 5 • θ 8 + b 3 , Q = h v + h a - h a n 1 n 2 (10) where E is the final output of the embedding network by Eq. 9. h 5 ∈ R n1n2×d is the hidden layer for embedding. h v ∈ R 1 is the hidden layer for the state function. h a ∈ R n1n2 is the hidden layer for the advantage function. θ 6 ∈ R d×d , θ 7 ∈ R d , θ 8 ∈ R d are the weights of the neural networks. b 1 , b 2 , and b 3 are the bias vectors. Q ∈ R n1n2 is the final output of our Q-value estimate network. It predicts the value of each action given the current state. The state function and advantage function are designed to separate the value of state and action. Specifically, the state function predicts the value of different states and the advantage predicts the value of each action given the particular state. The previous work (Wang et al., 2016) shows that this dueling architecture can better learn the impact of different actions. Besides, we force the sum of the output vector of the advantage function to 0 by subtracting the mean of it, which makes the separation of the state value and advantage easier. We use Q(s, a; f θ ) to denote the estimated Q-value by f θ when the agent takes action a on state s.

A.2.2 EXPERIENCE REPLAY MEMORY

For sample efficiency, we maintain a prioritized experience replay memory M (Schaul et al., 2016) that stores the experience of the agent, defined as the transition (s i , a i , r i , s i ) (denoting state, action, reward, and state of next step respectively). As the training progresses, we add new transitions to M and remove old transitions. The agents will take samples from the experience replay memory to update their neural networks. We follow the idea of prioritized experience replay memory, which adds a priority for each transition and higher priority denotes higher probabilities to be sampled: P (i) = (p i ) α j (p j ) α where P (i) is the probability and p i is the priority of the i-th transition. α is a hyperparameter. The calculation of p i is based on the underfitting extent of the transition, and a larger bias rate of the agent's Q-value estimation means a relatively higher priority.

A.2.3 MODEL UPDATING

We follow Double DQN (Hasselt et al., 2016) to calculate the loss function and update the parameters. We pick the next action a by the current Q-value estimate network f θ , but use the target Q-value estimate network f θ -to predict its value as Eq. 12 shows. The motivation of designing this loss function is: the Q-value that is overestimated in one network will be mitigated to an extent in another network. a = arg max a Q(s , a ; f θ ) (12) L(f θ ; M) = E s,a,r,s ∼M r + γQ(s , a ; f θ -) -Q(s, a; f θ ) 2 where L(•) is the loss function to be optimized. f θ is the current Q-value estimation network and f θ - is the target Q-value estimation network. s, a, r, s , a stand for the state, action, reward, next state, and next action. The target network f θ -shares the same architecture with f θ and the parameters of f θ -will be replaced by the parameters of f θ every period. The design of such a target network is for keeping the target Q value remains unchanged for a period of time, which reduces the correlation between the current Q value and the target one and improves the training stability. A.2.4 FURTHER DISCUSSION OF THE REVOCABLE ACTION FRAMEWORK Note that our revocable action mechanism requires most changes in our environment settings. Therefore, we design a new RL environment for the revocable action mechanism, and the main differences from the original environment are: the available set is gone; the agent can choose any vertex in the vertices V a of the association graph G a ; when the environment receives a vertex that is in conflict with the current partial solution, it releases the conflicted vertices and adds the new-coming vertex to the partial solution. The agent design is irrelevant to the basic or the revocable environment, and the training process of the revocable framework almost remains the same, as Alg. 1 shows. While the revocable flexibility also brings some side effects, e.g. we can not adopt the acceleration tricks for GNN, such as dynamic embedding (Wang et al., 2021b) , which can otherwise speedup the inference time. As for the stopping rules for our revocable action mechanism, there are three stopping cases and the current best solution will be returned: 1) when the number of matched nodes equals to the given or estimated input inlier count; 2) no affinity score improvement is made (the current best solution remains the same) in T es rounds where T es is the given hyperparameter for early stopping; 3) the number of rounds reaches the preset max round parameter T max . Besides, we add a small reward penalty r p in each step to make the agent away from repeatedly taking an action and canceling it. The usage of the these parameters can be seen in Alg. 1. To our best knowledge, there are in general two existing techniques allowing revocable actions, at least for RL-based combinatorial optimization: Local Rewrite (Chen & Tian, 2019) and ECO-DQN (Barrett et al., 2020 ). Here we discuss our difference from these methods. The local rewrite framework keeps improving the solution given as input by exchange the parts of it. However, the performance of the local rewrite highly relies on the input solution. In our empirical tries, the efficiency and effectiveness of local rewrite are unsatisfactory, as will be shown in our experiments. The ECO-DQN framework does have a promising performance in the Maximum Cut problem, but it is inherently designed to work on this specific Maximum Cut problem, which only has fewer or no constraints, and is clearly impossible to adapt to the graph matching problem. Therefore, in this paper, we devise a new revocable framework RGM to meet the characteristic of the graph matching problem, which is more friendly to graph with relatively more constraints. To verify the effectiveness of our proposed revocable framework, we compare it with the local rewrite framework in our experiments in Sec. A.4.6. We believe our devised scheme for revocable action is suited to the setting when the graph size is moderate to afford such a costly scheme, while the constraints are relatively heavy and complex to make the revocable action necessary, whereby graph matching has become a suited problem setting.

A.3 EXPERIMENTS PROTOCOL

We perform experiments on various benchmarks including image data as well as pure combinatorial optimization problem instances, as the latter is especially suited for our back-end solver. We evaluate the robustness against outliers, as the (ground truth or estimated) number of inliers is given as hyperparameter. Whenever this information is known or not, one can always apply our proposed regularization technique to improve its robustness against outlier. The experiments are conducted on a Linux workstation with NVIDIA 2080Ti GPU and AMD Ryzen Threadripper 3970X 32-Core CPU with 128G RAM. Note that the graph size in our experiments is mostly more than 11 and the enumeration of all permutations is at the scale of more than 20 million which means an exhaustive search is impossible on commodity computers.

A.3.1 HYPERPARAMETER SETTINGS

For the hyper parameters, we set γ = 0.9 in Eq. 12, the target network update frequency as 40, the replay size as 100,000. For the greedy part, the greedy rate decays from 1.0 to 0.02 in 20,000 episodes. For the learning module, we set 1e-5 as the learning rate and 64 as the batch size. The hidden size of our GNN is 128 and the number of layers T is 3. The hidden size of the Q value network is 64. For the affinity regularization module, the range of the data points used for approximation is S = [n x -2, n x + 2].

A.3.2 EVALUATION METRICS

For testing, given the affinity matrix K, RGM predicts a permutation matrix X pred ∈ {0, 1} n1×n2 transformed from its solution set U. Based on X pred and ground truth X gt ∈ {0, 1} n1×n2 (note that X gt equals the number of inliers, since the rows and columns of outliers are always zeros.). Two evaluation metrics are used: objective affinity score, and F1 score: Objective score = vec(X pred ) K vec(X pred ) vec(X gt ) K vec(X gt ) (13) Recall = X pred * X gt X gt (14) Precision = X pred * X gt X pred (15) F1 score = 2 • Recall • Precision Recall + Precision ( ) where * denotes element-wise matrix multiplication. Note that the defined objective score here is agnostic to the presence of outliers, which is a common protocol used in existing works. Specifically, for a traditional affinity matrix as used in previous works, its elements are set non-negative, and thus the solver generally aims to match node correspondences as many as possible. We also conduct experiments on the well-known QAPLIB dataset. For the problem instances on QAPLIB, the goal is to minimize the objective score. Besides, the ground truth solution is supposed unknown due to its NP-hard nature. Therefore, we use the gap between the score of the predicted solution and the optimal score provided in the benchmark which is continuously updated by uploaded new best solutions, as the metric: Optimal gap = vec(X pred ) K vec(X pred )optimal optimal (17) Note that in the QAP test, there is no outlier issue, and our solver purely optimizes the objective score, by matching all the nodes.

A.3.3 COMPARED METHODS

As mentioned before, RGM falls in line with learning-free graph matching back-end solvers that use the affinity matrix K as input, regardless K is obtained by learning-based methods or not. Both traditional methods and learning-based methods are compared: GAGM (Gold et al., 1996) utilizes the graduated assignment technique with an annealing scheme, which can iteratively approximate the cost function by Taylor expansion. RRWM (Cho et al., 2010) proposes a random walk view of the graph matching, with a re-weighted jump on graph matching. IFPF (Leordeanu et al., 2009) iteratively improves the solution via integer projection, given a continuous or discrete solution. PSM (Egozi et al., 2013) improves the spectral matching by presenting a probabilistic interpretation of the spectral relaxation scheme. GNCCP (Liu et al., 2012) follows the convex-concave path-following scheme, with a simpler form of the partial permutation matrix. BPF (Wang et al., 2018) designs a branch switching technique to seek better paths at the singular points, to deal with the singular point issue in the previous path following strategy. ZACR (Wang et al., 2020a) designs to suppress the matching of outliers by assigning zero-valued vectors to the potential outliers, which is the latest graph matching solver designated for outliers. In particular, we further compare RGM with current popular deep graph matching methods: GMN (Zanfir et al., 2018) , PCA (Wang et al., 2019b) , NGM (Wang et al., 2021a) , LCS (Wang et al., 2020c) , BBGM (Rolínek et al., 2020) , which are the state-of-the-art deep graph matching methods, and more importantly, most of them are all open-sourced which are more convenient for a fair comparison.

A.3.4 DATASETS FOR EVALUATION

We briefly describe the used datasets, in line with the recent comprehensive evaluation for deep GM (Wang et al., 2021a) . Synthetic Dataset is created by random 2D coordinates as nodes and their distances as edge features for graph matching. Specifically, one graph is randomly constructed to which random noise is further added to generate the other graph for matching. The ground truth matching normally is set as the identity matrix. Willow Object is collected from real images by Cho et al. (2013) . It contains 256 images from 5 categories, and each category is represented with at least 40 images. All instances in the same class share 10 distinctive image keypoints whose correspondences are manually labeled as ground truth. For testing the performance of handling outliers, we add several random outliers to each image. Pascal VOC (Bourdev & Malik, 2009) consists of 20 classes with keypoint labels on natural images. The instances vary by scale, pose and illumination. The number of keypoints in each image ranges from 6 to 23. QAPLIB (Burkard et al., 1997) contains 134 real-world QAP instances from 15 categories, e.g. planning a hospital facility layout or testing of self-testable sequential circuits. The problem size is defined as n 1 = n 2 by Lawler's QAP. We use 14 of the 15 categories, the only one left is "els", as there is only one sample in this category. For the synthetic dataset, we use geometry features to construct the affinity matrix with the train-test split rate (2 : 1) follows Jiang et al. (2021) . For the natural image dataset Willow Object and Pascal VOC, we use the pretrained features by CNN and GNN from BBGM (Rolínek et al., 2020) via supervised learning on the training set, and use the pre-splited train/test set (8 : 1) in line with BBGM as well. We input the learned affinity matrix to RGM and all learning-free methodsfoot_1 , to make the comparison with supervised methods as fair as possible. For QAPLIB, there is no need for front-end feature extractors as the affinity matrix is already given. The train-test split rate for RGM is (1 : 1) for each of the selected 14 categories, as we choose the smaller-size half of the instances to train RGM in that category. While for the peer method NGM (Wang et al., 2021a) , due to its model's nature, it does not split the train-test set in their QAPLIB experiments and test directly after training on the same set. In contrast, our RGM follows the basic protocol in RL, which splits the train-test set to make a relatively fair comparison with the baselines. A.4 ADDITIONAL EXPERIMENTS A.4.1 EXPERIMENTS ON SYNTHETIC DATASET We evaluate RGM on the synthetic graphs following the protocol of Wang et al. (2021a) . The synthetic data test is relatively simple, and it is mainly to show the effectiveness of our back-end solver when the front-end information is limited as there is no visual data for CNN to learn. More outlier tests will be given on the real data. We first generate sets of random points in the 2D plane. The coordinates of these points are sample from uniform distribution U (0, 1) × U (0, 1). First, we select 10 sets of points as the set of ground truth points. Then, we randomly scale their coordinates from U (1 -δ s , 1 + δ s ). The set of scaled points and the set of ground truth points are regarded as the pairwise graphs to be matched. We set there are 10 inliners without outlier, and the noise level δ s varies from 0 to 0.5. For the calculation of affinity matrix K, the node affinity is set by 0 and the edge affinity is set by the difference of edge length: K ia,jb = exp(- (fij -f ab ) 2 σ1 ), where the f ij is the edge length of E ij . We generate 300 sets of scaled points for each ground truth sets and get 3,000 pairwise graphs. We split the data into the training and testing sets by the ratio of (2 : 1). The results of the synthetic datasets are shown in Fig. 5 . Evaluations are performed in terms of the noise level δ s . We can see that RGM performs the best in terms of matching F1 score and objective score in all experiments. A.4.3 GENERALIZATION 1) Generalization to different amounts of outliers. We carry additional experiments to test the generalization ability among different numbers of outliers. In this study, we train our RGM on one certain number of outliers and test it on another setting. We conduct these experiments on Willow Object with 10 inliers and a range of outliers from 0 to 6. The results are shown in Fig. 7 , where for every testing case (column) darker red means better performance. We can see that our RGM can generalize well to the different numbers of outliers, since the performance of RGM is promising. Relatively speaking, training RGM with 2, 3, or 4 outliers can reach a better generalization performance, and training RGM with 3 outliers can reach the best. 2) Generalization among similar categories. Fig. 8 (a) shows the generalization ability of RGM among similar categories. We use one class for training and another class for testing. For every testing class (every column) the red is the darker the better. We can see RGM generalizes well to different classes. 2) Generalization on QAPLIB. Fig. 8 (b) shows the generalization ability of RGM, which is trained on one class and tested on another. For every testing class (every column), the darker red means the low optimal gap, the better performance. It shows that RGM generalizes soundly to unseen instances with different problem sizes.

A.4.4 HYPERPARAMETER SENSITIVITY STUDY

We conduct the study on Willow Object with the same setting as aforementioned experiments, where there are 10 inliers and 3 outliers in each image. We choose six hyperparameters: batch size, γ, experience replay size, hidden size in GNN, hidden size in Q value network, and set the regularization function for affinity regularization by three function forms. The results in Fig. 9 show that RGM is not sensitive to hyperparameters in batch size, hidden size, and regularization function, since there are only small fluctuations in F1 score. Here γ is the hyperparameter in Eq. 12 as the rate for considering current reward and long-term reward, whose value cannot be too high or too low. RGM performs badly when the replay size is too small for experience replay.

A.4.5 CASE STUDY: SEEDED GRAPH MATCHING

As aforementioned, the core of RGM is to learn the back-end decision making process for GM. Moreover, RGM can work flexibly by utilizing additional information e.g. the initial seeds, which mean that one or several nodes in each of the original pairwise graphs is already matched by human or other information sources. In implementation, we only need to set the initial seeds as the first several actions, and then let RGM execute normally. We conduct this case study on both the synthetic dataset and the Willow Object dataset, of which each image contains 10 inliers and 3 outliers. Fig. 10 shows that adding suitable initial seeds does improve the performance, which can be useful when the matching can be conducted by manual annotation in the beginning. A.4.6 ABLATION STUDY: REVOCABLE ACTION V.S. LOCAL REWRITE We study the effectiveness of our revocable action scheme, by comparing it with the local rewrite (LR (Chen & Tian, 2019) ) under the same RL framework. LR is an influential mechanism in RLbased combinatorial optimization. It tries to improve a given solution instead of generating one from scratch. To some extent, LR can also reverse the applied actions by its local rewrite mechanism and it is recognized as state-of-the-art technique for improving RL. We conduct comparative experiments on QAPLIB. We use LR to improve the solution given by the baselines and RGM without the revocable scheme (RGM w/o rev.), and compare the results with RGM. Table 7 shows that our revocable framework RGM still performs the best compared to all boosted baselines. It turns out LR does improve the original solutions, but its performance and efficiency is still below our revocable framework. A.4.7 ABLATION STUDY: USING ALTERNATIVE RL BACKBONES In RGM, we adopt Double Dueling DQN (D3QN) with priority experience replay as our backbone, for which we perform ablation study against alternatives on Willow Object with 10 inliers and 3 outliers. Fig. 11 shows the mean F1 score over five classes. We compare D3QN with popular backbones: A2C (Mnih et al., 2016) , ACER (Wang et al., 2017) , TRPO (Schulman et al., 2015) , PPO (Schulman et al., 2017) , the original DQN (Mnih et al., 2013) , and Double DQN (Hasselt et al., 2016) . D3QN outperforms all other algorithms in the Willow Object. We think that the main reason is that: graph matching is a discrete decision making problem, where the value based methods (DQN, DDQN, D3QN) can be more suitable than the policy based methods, which is widely accepted (Sutton & Barto, 2005; Ivanov & D'yakonov, 2019; Sutton & Barto, 2018) . A.4.8 INCONSISTENCY BETWEEN AFFINITY OBJECTIVE AND F1 Finally, we discuss a standing issue in graph matching and possibly also in other optimization tasks regardless the presence of outliers. The matching accuracy or F1 score is inconsistent with the value of objective function. In our analysis, the reason is probably that the objective function cannot perfectly model the ultimate goal, due to limited modeling capacity and noise etc. For example, as shown in Table 8 , when applying our RGM and RRWM (or any other method is fine from the different-quality solution sampling perspective) on the real image dataset Pascal VOC, the resulting quantile statistics about the objective score deliver an important message: 39.4% sampled solutions by RGM can achieve an objective score even higher than one, which means these wrong solutions can even get a higher score than the ground truth matching. Note that the front-end features CNN/GNN and affinity metric model are learned by the state-of-the-art supervised model BBGM, however the learned affinity function seems still not perfectly fit with the F1 score. In fact, this mismatch issue relates to the front-end affinity learning and outlier dismissing, which is not the scope of the back-end solver as focused in this paper. As Table 8 shows, though there are 39.4% instances whose affinity scores are larger than 1, at the other 60.6% instances, our RGM can solve threequarters of these instances perfectly. Compared to RRWM, our RGM can solve more instances with the affinity score equals to or nears to one, and that's why our RGM can outperform RRWM and other baselines. Despite its impressive results achieved in our paper, it also suggests the limitation of RL-based solvers which pursuits the high objective score which can sometimes be biased. One possibly mitigation way is involving multiple graphs (Yan et al., 2016; Jiang et al., 2021) which we leave for future work.



For clarification, this paper uses "node" for the raw graphs and "vertex" for the association graph. ZACR is an exception which will be explained in Sec. A.4.2 in detail.



Figure 3: Examples of the original affinity matrix and the regularized affinity matrix: (a) original affinity matrix K; (b)(c)(d) regularized affinity matrix K reg by quadratic approximation, with three different regularization functions, of which some values become negative.

Here we list an index of those experiments in the appendix. If you are interested in several experiments listed below, please refer to Appendix. (A.4.1) Experiments on the synthetic dataset: evaluating RGM on the synthetic images. (A.4.2) Experiments on the different categories of the Willow Object: results over five categories on the Willow Object dataset. (A.4.3) Generalization test of RGM: generalization to different amounts of outliers, similar categories, and different classes of QAP instances. (A.4.4) Hyperparameter sensitivity study: testing the sensitivity of the hyperparameters in RGM. (A.4.5) Seeded graph matching: graph matching with several initial seeds. (A.4.6) Revocable action v.s. Local Rewrite: comparison of RGM and popular Local Rewrite framework (Chen & Tian, 2019). (A.4.7) Alternative RL backbones: testing alternative RL backbones instead of D3QN in RGM. (A.4.8) Limitation analysis of the inconsistency: analysis of the inconsistency between the affinity score and the F1 score in the aforementioned experiments.

Choosing next action: 15 if Revocable Mechanism then 16 Set available vertices set V to the whole vertices V a of the association graph G a ; 17 else 18 Calculate available vertices set V by Eq. 3; 19 With probability select a random action a ∈ V otherwise select a = arg max a∈V Q(s, a; f θ ); (s ) • f (|s |) -J(s) • f (|s|); 25 else 26 r = J(s ) -J(s); 27 if Revocable Mechanism then 28 r = r -r p ; 29 Store the transition (s, a, r, s ) in M; 30 -

Count and |s| == n i ) or (current best solution unchanged in T es rounds) then 43 break;

Figure 5: Performance comparison w.r.t F1 score and objective score (the higher the better) by increasing noise level on the synthetic dataset.

Figure 6: Visual illustration of the matching results by RGM on the Willow Object dataset with 10 inliers (green), and 3 outliers (blue) which are randomly extracted from the images. Green and red lines represent correct and incorrect node matchings respectively. The correct solution is supposed to match all green inliers with green line. The subtitle of each figure shows the correct / incorrect matching count out of the 10 inliers.

Figure7: Generalization test for number of outliers by F1 Score (↑). Row and column indices denote the amount of outliers on training and testing set, respectively. The average F1 on all five classes in Willow Object is reported. For each testing set (column), the darker red the better.

Figure 8: Generalization test w.r.t (a) F1 score (↑), (b) optimal gap (↓). Row and column indices denote training and testing classes, respectively. For every testing class (column), the darker red the better.

Figure 11: Average F1 score (↑) of different RL algorithms, on five classes of Willow Object with 10 inliers and 3 outliers in both sides for matching.

Average performance w.r.t F1 score and objective score (the higher the better) in the Willow Object dataset with respect to different numbers of randomly added outliers given the 10 inliers. "AR": affinity regularization, "IC": inlier count information, "w/o rev.": without the revocable.

Sensitivity test by using inexact inlier count n i ranging from 8 to 13, instead of the ground truth 10 on Willow Object. Experiments are conducted with 10 inliers and 3 outliers.

Average performance over all classes on Pascal VOC. "w/ label" denotes requiring a label.

Performance gap with the optimal (%) (the lower the better) on QAPLIB, the mean/max/min gaps are reported for each class. The mean performance over all classes and the inference time (s) per instance are reported. The number in the bracket is the size of instances in each category. -free methods, which use the same input as RGM thus leading to a fair comparison. In other words, the left side methods are the pure back-end solvers that pursue the highest objective score.

Representative deep GM works. KB's means Koopmans-Beckmann's QAP which is a special form of Lawler's QAP. The appearance feature and structure feature are often modeled by CNN and by GNN, respectively. The affinity model is often relatively simple by a Gaussian kernel or MLP.

Algorithm 1: Training algorithm for RGM. It also consists of revocable action (Sec. 4.2), inlier count information (Sec. 4.3.1), and affinity regularization (Sec. 4.3.2). Input: Dataset D; step size η; exploration rate ; updating frequency c 1 , c 2 ; inlier count n i ; max round T max ; early stop round T es ; reward penalty r p . Output: Well trained Q-value network f θ .Construct the association graph G a from G 1 G 2 , and get its affinity matrix K;





Comparison with the local rewrite (LR)(Chen & Tian, 2019) boosting technique on QAPLIB dataset w.r.t the optimal gap (the lower the better) and inference time in seconds, where "+LR" denotes using local rewrite given the output from the original methods, and "RGM w/o rev." denotes our method without the revocable mechanism as described in Sec. 4.2.

Quantile of objective score and F1 score of the solutions found by RGM and RRWM on bus category in Pascal VOC without outlier. The discrepancy of F1 score and objective score is clear.

funding

* Correspondence author is Junchi Yan. The work was in part supported by National Key Research and Development Program of China (2020AAA0107600), National Natural Science Foundation of China (62222607), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and Huawei Technologies.

annex

Table 6 : Performance comparison w.r.t F1 score and objective score (the higher the better) in the Willow Object dataset, where "F1" and "Obj" are short for F1 score and objective score. All images contain 10 inliers and 3 randomly generated outliers in both graphs. For our RGM, "RGM + AR" means RGM with affinity regularization, "RGM + IC" means RGM with inlier count information, and "RGM + AR + IC" means RGM with both. We add the ablation study about RGM without the revocable mechanism in the last three column of the 6 shows the results over the five categories give three outliers. We compare our methods (bottom box) with the learning-free back-end solvers (top box) and the learning based deep graph matching methods (middle box). We use learning features extracted by BBGM (Rolínek et al., 2020) to construct the affinity matrix, which is used as input for all learning-free back-end solvers and our methods. For other learning-based baselines, we train and test them by their own pipelines directly, and they do not report their objective score because learning-based methods only care about the accuracy or F1 score. Since there are several outliers that should not be matched, the ground truth matching results only contain parts of the input keypoints (10 actually). Therefore, we use the F1 score instead of accuracy to test the performance.One may attribute the advantage of "RGM + IC" and "RGM + AR + IC" to the use extra information of the inlier count which is a unique ability of our RL-based model compared to peer baselines. Yet "RGM + AR" does not require the inlier count information as input, and can still outperform all baselines in almost all settings.We find a strange result that BPF (Wang et al., 2018) reaches the best performance rather than ZACR (Wang et al., 2020a) , which is the latest GM solvers tailored for handling outliers. Per discussion with the authors of ZACR, this is perhaps mainly due to a few strong assumptions they made which is more suitable to the 50 images (30 cars and 20 motorbikes) they used for experiments in their paper, which may not always hold in other datasets including Willow Object. For example, ZACR requires edges linked by two inliers have clear higher similarities than the edges linked by inlier-outlier or outlier-outlier. Besides, by its inherent design, ZACR solves Koopmans-Beckmann's QAP instead of Lawler's QAP, and therefore has some difficulty utilizing the affinity matrix which is obtained by pretraining BBGM. Via the communication and discussion with the authors of ZACR, we have tried to modify their code including using the learned node and edge features inline with their model's intereface. Regrettably, the results in Table 6 and Table 1 are the best results we can get.The matching visualization is given in Fig. 6 . We visualize the matching results of RGM on all five categories. We paint the inliers as green nodes and the outliers as blue nodes. In each pair of images, the green and red lines represent correct and incorrect predictions respectively. Since it is supposed to match all inliers and ignore the outliers, we can see that RGM barely matches the blue outliers and focuses on the green inliers.

