KEYPOINT MATCHING VIA RANDOM NETWORK CON-SENSUS

Abstract

Visual description, detection, and matching of keypoints in images are fundamental components of many computer vision problems, such as camera tracking and (re)localization. Recently, learning-based feature extractors on top of convolutional neural networks (CNNs) have achieved state-of-the-art performance. In this paper, we further explore the usage of CNNs and show that it's possible to leverage randomly initialized CNNs without training. Our observation is that the CNN architecture inherently extracts features with certain extents of robustness to viewpoint/illumination changes and thus, it can be regarded as a descriptor extractor. Consequently, randomized CNNs serve as descriptor extractors and a subsequent consensus mechanism detects keypoints using them. Such description and detection pipeline can be used to match keypoints in images and achieves higher generalization ability than the state-of-the-art methods in our experiments.



Figure 1 : Keypoint matching results on two pairs of color images with certain extents of viewpoint change. On the left part, we apply the trained off-the-shelf SuperPoint (DeTone et al., 2018) and visualize its correct matches shown in green line segments. On the right part, we make use of the same network architecture of SuperPoint, but apply 3 different sets of random parameters. The correct matches produced by the 3 random CNNs are visualized in blue, yellow, and red line segments, respectively. Different network parameters produce different matches, yet we observe their overlaps in several regions. The images are from the 7-Scenes dataset (Shotton et al., 2013) . ently extracts features with a certain extent of robustness to viewpoint/illumination changes. Therefore, the extracted features can be directly used as visual descriptors, as shown in Figure 1 . Since no training is required, it is free to obtain a set of visual feature descriptors by changing the random seed that generates the network parameters. In order to minimize the number of incorrect matches (e.g., due to similar and ambiguous image regions), we propose a consensus mechanism considering several randomly generated descriptors simultaneously to filter incorrect matches. This consensus design can be regarded as keypoint matching and, to our experiments, it successfully filters out a large amount of wrong matches. The final set of matches, consistent with the epipolar geometry, is found by the widely-used RANSAC (Fischler & Bolles, 1981) algorithm. We summarize our contributions as follows: • We show the possibility that using CNNs for keypoint description, detection, and matching, without requiring the deep networks to be trained. This allows the algorithm to generalize well across multiple modalities (domains). • Benefiting from our no-training design, we can freely generate multiple descriptors for each keypoint. This allows for introducing a consensus mechanism to detect robust keypoint matches among the candidates produced by randomized CNNs. • The proposed pipeline achieves similar performance compared with state-of-the-art detectors and descriptors while performing better on the images of new modalities.

2. RELATED WORK

Traditional keypoint detection and description. In general, a good keypoint should be easy to find and ideally the location of the keypoint is suitable for computing a visual descriptor. Therefore, early works (Harris et al., 1988; Shi et al., 1994; Lowe, 2004; Mikolajczyk & Schmid, 2004) detect keypoint as various types of edges, corners, blobs, shapes, etc. In recent decades, detectors and descriptors like SIFT (Lowe, 2004) , SURF (Bay et al., 2006) , and RootSIFT (Arandjelović & Zisserman, 2012) are widely used due to their generality. Benefiting from time efficiency, binary descriptors such as BRIEF (Calonder et al., 2011) , BRISK (Leutenegger et al., 2011) , and ORB (Rublee et al., 2011) are also popular in many real-time applications, namely the series works of ORB-SLAM (Mur-Artal et al., 2015; Mur-Artal & Tardós, 2017) . The descriptors designed in the aforementioned traditional approaches, in general, are local statistics with certain extents of invari-ance to scale and rotation. In this paper, we showcase that random statistics stem from convolutional neural networks (CNNs) can also be used as visual descriptors. Learning-based keypoint detection and description. FAST (Rosten & Drummond, 2006) is the first approach that introduces machine learning for corner detection. Recent works (Savinov et al., 2017b; Zhang & Rusinkiewicz, 2018; Di Febbo et al., 2018; Laguna & Mikolajczyk, 2022 ) make use of deep learning with CNNs to boost the performance. Most of the learning-based methods focus on description (Simonyan et al., 2014; Simo-Serra et al., 2015; Balntas et al., 2016; Savinov et al., 2017a; Mishchuk et al., 2017; He et al., 2018; Luo et al., 2019) . Based on the traditional detection-then-description pipeline, LIFT (Yi et al., 2016) takes both keypoint detection and description into account. SuperPoint (DeTone et al., 2018) is the first approach to perform both tasks in a single network. One problem with supervised learning of keypoint detectors is that how to define the saliency. SuperPoint first makes use of a synthetic dataset consisting of different shapes and regards the junctions as keypoints for pre-training. Then homographic adaptation is applied to other datasets (e.g., MS-COCO (Lin et al., 2014) ) for self-supervised learning. D2-Net (Dusmanu et al., 2019) proposes to perform detection after description, an additional loss term is added to seek repeatability. Meanwhile, the keypoints should be not only repeatable but also reliable, which motivates the approach of R2D2 (Revaud et al., 2019) . Other recent works (Noh et al., 2017; Ono et al., 2018; Luo et al., 2020; Tyszkiewicz et al., 2020; Li et al., 2022) apply a similar pipeline and contribute on network designs and training mechanisms. In this paper, we focus on the approaches with simple yet effective network architectures and explore the impact of randomness and consensus mechanism. Specifically, SuperPoint (DeTone et al., 2018) , D2-Net (Dusmanu et al., 2019) , and R2D2 (Revaud et al., 2019) are chosen as representatives. There are also learning-based dense or semantic correspondence predictions such as UCN (Choy et al., 2016) and NBB (Aberman et al., 2018) , which are beyond our scope. Consensus mechanism. Robust estimation is the problem of simultaneously estimating the parameters of an unknown mathematical model and finding the points consistent with it (i.e., inliers) in a set of noisy inliers and large-scale measurement errors (i.e., outliers). One of the most popular robust estimators is the RANdom SAmple Consensus (RANSAC) (Fischler & Bolles, 1981 ) that iteratively selects minimal sets of data points, estimates the model parameters, and calculates the support (i.e., number of inliers). There are many variants (Brachmann et al., 2017; Barath & Matas, 2018; Barath et al., 2020; Ivashechkin et al., 2021) and the idea of voting and consensus are widely used in computer vision problems such as visual localization (Brachmann & Rother, 2019; Huang et al., 2021) , object detection (Qi et al., 2019) , and pose estimation (Peng et al., 2019) . In this paper, we apply the idea of voting and perform a consensus mechanism to detect robust keypoint matches.

3. METHOD

Problem statement. Given a pair of images containing overlapping scene regions, the task of keypoint matching is to find a set of pixel-wise matches that correspond to the same underlying 3D scene points. These matches enable downstream tasks, e.g., pose estimation and Structure-from-Motion (SfM). Note that the aforementioned camera pose estimation is a minimal problem that requires only a few high-precision matches. However, in practice, due to low precision and outliers existing, a certain amount of matches are required to run robust estimation. Method overview. Figure 2 illustrates our proposed method. The input is a pair of images, and the output is a set of pixel-wise matches. We make use of m VGG-style (Simonyan & Zisserman, 2014) convolutional neural networks (CNNs) with random parameters, i.e., there are m different visual descriptor extractors. Therefore, for each pixel in each image, we obtain m descriptors. Next, we apply a matcher (e.g., the nearest neighbor matcher) across images to select similar pixels based on the extracted descriptors. Note that the matching process is executed independently for each extractor. Consequently, we obtain m sets of match candidates. These candidates are then fed into a consensus mechanism to produce the final matches. Below, we first describe the randomized CNNs to generate match candidates in Section 3.1, and then introduce the consensus mechanism to produce final matches in Section 3.2. 

3.1. RANDOM DESCRIPTION

The keypoint extraction process of a single CNN is illustrated in Figure 3 . Following Super-Point (DeTone et al., 2018) , we apply a simplified network architecture as the descriptor extractor f that takes the full image I H×W as input and produces the feature map F H×W ×N = f (I H×W ) as pixel-wise descriptors, where H, W ∈ N refer to the image height and width, and N ∈ N refers to the dimension of descriptors. Before applying descriptor matcher, the feature map F is processed to a saliency map to filter out homogeneous regions. Descriptor. In our method, the CNNs are randomly initialized without any training. Our intuition is that a convolution kernel computes a certain type of local statistics inside its receptive field, just like traditional methods that count handcrafted gradients and histograms. Therefore, a CNN is a combination of kernels to count statistics of statistics. In the literature, there are machine learning-like algorithms that apply random statistics to solve computer vision problems, such as in place recognition (Glocker et al., 2014) and visual localization (Cavallari et al., 2019) , which demonstrate the effectiveness of randomized-then-fixed statistics. Note that, in our method, the parameters of CNNs are also fixed after the random initialization, so that each CNN computes consistently the same type of statistics at inference time. Consequently, a descriptor can only be used to match the same type of descriptors extracted by the same CNN. Since we employ multiple CNNs independently, the process of keypoint extraction and matching can be deployed in parallel. Saliency. Before matching, we leverage a saliency detection process to reduce the matching space. Directly matching the descriptors from a randomized CNN results in many candidates lying on homogeneous regions, such as textureless floors and walls. To filter out these meaningless candidates, we adopt the keypoint detection formulation proposed in D2-Net (Dusmanu et al., 2019) . The key idea is to detect local maxima in the high-level visual descriptor space, rather than detecting local 2D patterns in the low-level image color space. Specifically, the detection formulation considers two aspects: a local softmax α among nearby pixels in each feature channel, and a ratio β among the feature channels of each pixel. The α and β scores are defined as follows: α i,j,k = exp (F i,j,k ) (i ′ ,j ′ )∈N (i,j) exp (F i ′ ,j ′ ,k ) , β i,j,k = F i,j,k max t F i,j,t , where N (i, j) refers to the neighbor pixels' locations around the pixel at (i, j), including itself. The saliency score is defined as s i,j = max k (α i,j,k • β i,j,k ) and then image-level normalized. Note that the aforementioned process is a forward computation without additional parameters. In D2-Net (Dusmanu et al., 2019) , the formulation is used to perform soft detection during training. To our experiments, we observe that the process effectively assigns high scores to salient pixels even if the CNN parameters are randomized. Matching. As for keypoint matching, we make use of a classical nearest neighbor matcher that, for each descriptor, in one image, it retrieves the top-2 similar descriptors in another image and computes a ratio test (Lowe, 2004) to filter out ambiguous descriptors. Then, a mutual nearest neighbor check is applied to keep only those matches that are stable in the two matching directions, i.e., from the left to the right image and vice versa. A match candidate is represented as p = {(i 1 , j 1 ), (i 2 , j 2 )}, where i, j ∈ N refer to the 2D locations of the two associated keypoints. For each type of the descriptors from the same CNN f i , the aforementioned matching process is executed independently to generate a set of match candidates M i = {p i1 , p i2 , ..., p i k }. As a result, we obtain m sets of match candidates as the input for the following consensus mechanism.

3.2. CONSENSUS MECHANISM

Directly taking all the match candidates M = m i=1 M i to the downstream task such as pose estimation often fails due to a large proportion of wrong matches. In this section, we introduce a simple and effective consensus mechanism that rejects incorrect matches early. For each keypoint in M, our goal is to find a correct match or to discard it. The idea, inspired by RANSAC (Fischler & Bolles, 1981) , is to first generate model hypotheses using random minimal samples and then vote for each hypothesis using the rest of the samples to select the most consensus one. In our problem, a randomly selected candidate p ∈ M serves as the minimal sample (i.e. a model hypothesis) that gives a match between keypoints (i 1 , j 1 ) and (i 2 , j 2 ). Then the rest of the match candidates correlated with the two keypoints vote if they support the hypothesis. This process is achieved by keypoint clustering and consensus scoring, which are introduced in detail below. Keypoint clustering. Given a match hypothesis generated from p x = {(i x1 , j x1 ), (i x2 , j x2 )} ∈ M, the objective is to check if it is in consensus with other correlated match candidates in M. The correlated candidates M x ⊆ M are obtained by seeking all the candidates that are associated with keypoints (i x1 , j x1 ) or (i x2 , j x2 ). With the keypoints in M x , we apply 2D location clustering in each image separately. According to the clustering results, we compute a consensus score as the measurement of the distribution. If the keypoints are well distributed, h passes the consensus check and we update the keypoint locations with the center points of the most consensus clusters. Otherwise, the hypothesis h will be discarded. The hypotheses with optimized keypoint locations are output as the final robust keypoint matches. Consensus scoring. To quantitatively measure the consensus status (keypoint distribution) after clustering, we introduce the consensus score on top of the clusters Q. First, the clusters containing only one keypoint are immediately discarded. For each remaining cluster q ∈ Q, we compute a density score defined as d q = |q| /std(q), where |q| refers to the number of keypoints, and std(q) refers to the standard deviation of the 2D locations to approximate the cluster radius. Finally, the consensus score is defined as c = d if |Q| = 1 max q (d q )/ q∈Q d q otherwise. Three examples of the consensus status are illustrated in Figure 4 , and the set of keypoints in orange gains the best score among the three. Generality. The proposed consensus mechanism above is agnostic to keypoint descriptors, detectors, and matchers. Therefore, their alternatives such as trained SuperPoint and SuperGlue (Sarlin et al., 2020) can also be ensembled into the framework.

4. EXPERIMENTS

In this section, we validate the effectiveness of our method. We first elaborate on the implementation details in Section 4.1. Then, we conduct comparisons with state-of-the-art representative methods on both matching performance in Section 4.2 and pose estimation in Section 4.3. Last, we perform analysis and ablation studies on our method in Section 4.4.

4.1. IMPLEMENTATION DETAILS

In all the experiments, we make use of a 7 layers convolutional neural network (CNN) as a basic descriptor extractor. Each layer of the network is followed by a ReLU activation, except for the last layer, and each of the first 3 activations is followed by a pooling layer. The last feature maps containing N = 256 dimensional descriptors are normalized to unit vectors before output. We apply bilinear interpolation to resize the descriptors to align the input resolution. By default, we employ m = 8 randomized CNNs for visual feature extraction. With the saliency map, we only select the keypoints with higher scores than the median for the following matching process. The ratio test threshold during the matching process is set to 0.95. As for the consensus mechanism, we apply the MeanShift (Pedregosa et al., 2011) algorithm with 24 pixels bandwidth to perform keypoint clustering. The threshold of the consensus score is set to 1.0 by default, i.e., we output a match if there is only one valid and compact cluster.

4.2. KEYPOINT MATCHING

Competitors. There are various related approaches on top of CNN based keypoints, and we believe the following methods are the most relevant and representative to be our competitors: DELF (Noh et al., 2017 ), SuperPoint (DeTone et al., 2018) , D2-Net (Dusmanu et al., 2019) , and R2D2 (Revaud et al., 2019) . We also apply RootSIFT (Arandjelović & Zisserman, 2012) as the representative of traditional handcrafted approaches. Datasets. We conduct experiments on the 7-Scenes (Shotton et al., 2013) and MegaDepth (Li & Snavely, 2018) datasets since they provide depth images that can be used to verify matches densely. The 7-Scenes dataset consists of 7 indoor scenes with RGB-D images, and ground truth camera poses. Several sequences are officially divided into training and test sets for each scene. Since noisy poses exist in the training set, we only use the test set for our evaluation. To avoid view selection biases, we uniformly sample each test sequence to form our test pairs. We use the original color images, calibrated depth images, and normal images (computed based on depth images) as our test input to evaluate the generalization ability to different modalities (domains). The MegaDepth dataset contains outdoor scenes with RGB-D images; we use a subset with ground truth camera poses from (Tyszkiewicz et al., 2020; Sun et al., 2021) as the test set. Since the MegaDepth evaluation set is part of the training set of D2-Net, we report the results of D2-Net only on the depth and normal images in our evaluation. Metrics. To quantitatively evaluate the keypoint matching results, we apply matching accuracy (i.e., the proportion of correct matches) and the number of correct matches as our metrics. On the 7-Scenes dataset, a match is considered correct if the distance between the corresponding 3D points is lower than the thresholds (1cm and 5cm). On the MegaDepth dataset, due to scale ambiguity, we apply thresholds (5 pixels and 20 pixels) on reprojection error instead of absolute 3D distance. Results. The results are shown in Table 1 . On the 7-Scenes dataset, our method obtains more correct matches while the accuracy is comparable with the competitors. For depth and normal images, compared with those in color images, all the methods suffer from performance drops, but we are overall the best. It demonstrates the effectiveness of the randomized CNNs with the consensus mechanism. The MegaDepth dataset is much more challenging than 7-Scenes due to the large viewpoint changes, and we observe that our method obtains reasonable results. Besides the quantitative results, we visualize the keypoint matches of two samples in Figure 5 . 

4.3. POSE ESTIMATION

Solvers and metrics. On the 7-Scenes dataset, we apply PnP (Lepetit et al., 2009) algorithm with RANSAC (Fischler & Bolles, 1981) to solve absolute poses, and we report median translation and rotation errors. On the MegaDepth dataset, due to scale ambiguity, we solve relative poses from essential matrix estimation (Stewenius et al., 2006) with RANSAC, and we report the average under the recall curve (AUC) as in (Sarlin et al., 2020) . Results. The results are shown in Table 2 . On the 7-Scenes dataset, our poses on depth and normal images are marginally worse than SuperPoint, although our matching results are better. For the MegaDepth dataset, our results on depth images are close to the state-of-the art, while the results on color and normal images lag behind. The aforementioned evaluation reveal limitations of our method, which are detailly discussed in Section 4.4. Effectiveness of saliency. To evaluate the help of the saliency computed on top of descriptors, we conduct experiments on the 7-Scenes dataset with a single randomized CNN. When we select only the descriptors with higher saliency scores than the median, the matching accuracies achieve 70.07%(+3.58%), 34.43%(+11.37%), and 48.73%(+14.42%) for color, depth, and normal images. In Figure 6 , we sample 5 images from 7-Scenes and MegaDepth datasets and visualize the locations of the top 50 salient descriptors in each image. We are glad to observe that the visualized descriptor locations overall locate around salient regions. Ablation studies. In Table 3 , we report ablation studies of our method using different numbers of CNNs and ensemble the trained SuperPoint and SuperGlue. The result of the single CNN does not apply the consensus check. With consensus check, when there are 3 CNNS, we get a high accuracy but a low number of correct matches. As the number of CNNs increases, the accuracy drops slightly while we get more correct matches. The 7+SP refers to a total of 8 CNNs, one of which is the trained SuperPoint. The SuperGlue matcher is only applied to SuperPoint. As they have only 1/8 proportion, SuperPoint and SuperGlue do not impact much. Table 3 : Ablation studies on the 7-Scenes dataset. We report the matching accuracy and the number of correct matches for each variant of our method. Limitations. Below we analyze the limitations of our method. From Table 1 we observe that for depth and normal images on the 7-Scenes dataset, we are better than SuperPoint on both matching accuracy and number of matches under different thresholds. However, our camera pose estimations shown in Table 2 are less precise. We blame it on the distribution of the matched keypoints, as shown in Figure 7 . Therefore, improving the camera pose solver to deal with the misleading of well-structured outliers will be a good future research direction. Another limitation of our method is that there is no guarantee for the descriptors to be scale/rotation invariant. As shown in Figure 8 , on the HPatches (Balntas et al., 2017) dataset, our method overall underperforms R2D2 and SuperPoint. The main reason is that our method fails on the images with very large viewpoint changes. (Dusmanu et al., 2019) , we apply the metric of mean matching accuracy (MMA).

5. CONCLUSION

In this paper, we present a new approach that makes use of random statistics extracted by randomized convolutional neural networks (CNNs) as visual descriptors, followed by a consensus mechanism to perform keypoint matching among images. Incorporating scale/rotation invariance will definitely improve performance. Also, it is worth more exploration and research on network architectures.



Figure 2: The pipeline of our method. A set of convolutional neural networks (CNNs) with random parameters serve as visual descriptor extractors. The input two images are fed into each CNN to extract pixel-wise features as descriptors. Next, for the descriptors outputted by the extractors, a matcher is applied to compute match candidates between the two images. The candidates are then fed into a consensus mechanism to select the final matches.

Figure 3: The keypoint extraction (description and saliency estimation) pipeline for a single image with a single randomized CNN. The CNN produces dense feature maps as pixel-wise visual descriptors. Salient descriptors are then used for the keypoint matching process, in which the saliency is estimated by normalization on top of the dense descriptors.

Figure 4: Illustration of our consensus status on match hypotheses. The associated keypoints of each hypothesis are shown in the same color. The keypoints in the two images (left and middle) are separately grouped into clusters represented as center points (right).

Figure 5: Visualization of matching results. On the left, we sample a pair of depth images from the 7-Scenes dataset. On the right, we sample a pair of normal images from the MegaDepth dataset.

Figure 6: Visualization of the salient descriptors in our method using a single randomized CNN. The left three samples are color, depth, and normal images from the 7-Scenes dataset and the right two are depth and normal images from the MegaDepth dataset. The salient locations are shown in red dots in each image.

Figure 7: A failed pose estimation: we visualize inliers from the pose solver containing incorrect matches in blue line segments while the ground truth matches are shown in green line segments.

Figure8: Results of keypoint matching on the HPatches dataset. As in(Dusmanu et al., 2019), we apply the metric of mean matching accuracy (MMA).

Results of keypoint matching on the 7-Scenes and MegaDepth datasets. The best and second-best numbers are labeled red and blue, respectively.

Results of camera pose estimation on the 7-Scenes and MegaDepth datasets. We first validate the sensitivity of the descriptors stemming from randomized CNNs. To do so, we test a single CNN (without the saliency computation) with 10 different random seeds on the 7-Scenes dataset. The mean matching accuracies for color, depth, and normal images are 66.49%, 23.06%, and 34.31%, with standard deviations 2.

Color Image 70.07% / 182 82.70% / 95 79.46% / 165 78.55% / 207 78.57% / 212 78.43% / 213 Depth Image 34.43% / 32 50.00% / 9 45.45% / 22 43.69% / 32 43.80% / 33 43.83% / 33 Normal Image 48.73% / 38 83.33% / 8 75.51% / 21 73.68% / 33 72.73% / 33 72.73% / 35

