KEYPOINT MATCHING VIA RANDOM NETWORK CON-SENSUS

Abstract

Visual description, detection, and matching of keypoints in images are fundamental components of many computer vision problems, such as camera tracking and (re)localization. Recently, learning-based feature extractors on top of convolutional neural networks (CNNs) have achieved state-of-the-art performance. In this paper, we further explore the usage of CNNs and show that it's possible to leverage randomly initialized CNNs without training. Our observation is that the CNN architecture inherently extracts features with certain extents of robustness to viewpoint/illumination changes and thus, it can be regarded as a descriptor extractor. Consequently, randomized CNNs serve as descriptor extractors and a subsequent consensus mechanism detects keypoints using them. Such description and detection pipeline can be used to match keypoints in images and achieves higher generalization ability than the state-of-the-art methods in our experiments.

1. INTRODUCTION

Keypoint detection, description, and matching in images are fundamental building blocks in many computer vision tasks, such as visual localization (Sattler et al., 2016; Taira et al., 2018; Dusmanu et al., 2019; Revaud et al., 2019; Sarlin et al., 2019; 2020; Tang et al., 2021) , Structure-from-Motion (SfM) (Snavely et al., 2006; Wu, 2013; Cui & Tan, 2015; Schönberger & Frahm, 2016; Lindenberger et al., 2021) , Simultaneous Localization and Mapping (SLAM) (Mur-Artal et al., 2015; Mur-Artal & Tardós, 2017; Dai et al., 2017 ), object detection (Csurka et al., 2004; Yang et al., 2019) , and pose estimation (Suwajanakorn et al., 2018; Kundu et al., 2018) . The keypoints, in general, refer to the salient pixels that are then matched across images forming point-to-point correspondences. They should be discriminative and robust to viewpoint/illumination changes to be accurately matched. Traditional approaches follow a detection-then-description pipeline that first detect salient pixels (Harris et al., 1988; Lowe, 2004; Mikolajczyk & Schmid, 2004) then compute local descriptors (Lowe, 1999; Bay et al., 2006; Calonder et al., 2011; Rublee et al., 2011) on top of those pixels. Typically, the detectors consider low-level 2D geometry information such as corners and blobs. To deal with large viewpoint distances, scaled pyramids are applied with Laplace of Gaussian (LOG), Difference of Gaussian (DOG), etc. For description, local statistics such as gradients and histograms are computed and used as visual descriptors. To name a few, SIFT (Lowe, 1999), and its variant RootSIFT (Arandjelović & Zisserman, 2012), are still popular nowadays due to their generality. In recent years, learning-based approaches on top of convolutional neural networks (CNNs) (Yi et al., 2016; Noh et al., 2017; Ono et al., 2018; Mishkin et al., 2018; DeTone et al., 2018; Dusmanu et al., 2019; Revaud et al., 2019) achieve promising results, especially in extreme appearance changes, such as images taken at day or night (Zhou et al., 2016) , and across seasons (Sattler et al., 2018) . Compared with traditional handcrafted approaches, the key advantage of introducing deep learning is the ability to learn robust keypoint representations from large-scale datasets. The aforementioned methods apply either supervised or self-supervised learning mechanisms to train their networks. After training, the off-the-shelf detectors and descriptors generalize well to several new datasets. Benefiting from their simplicity and effectiveness, the learned features such as Super-Point (DeTone et al., 2018), D2-Net (Dusmanu et al., 2019), and R2D2 (Revaud et al., 2019) are widely used nowadays. In this paper, we further explore CNNs in keypoint detection, description, and matching, without requiring the deep networks to be trained. Our observation is that the CNN architecture itself inher- ently extracts features with a certain extent of robustness to viewpoint/illumination changes. Therefore, the extracted features can be directly used as visual descriptors, as shown in Figure 1 . Since no training is required, it is free to obtain a set of visual feature descriptors by changing the random seed that generates the network parameters. In order to minimize the number of incorrect matches (e.g., due to similar and ambiguous image regions), we propose a consensus mechanism considering several randomly generated descriptors simultaneously to filter incorrect matches. This consensus design can be regarded as keypoint matching and, to our experiments, it successfully filters out a large amount of wrong matches. The final set of matches, consistent with the epipolar geometry, is found by the widely-used RANSAC (Fischler & Bolles, 1981) algorithm. We summarize our contributions as follows: • We show the possibility that using CNNs for keypoint description, detection, and matching, without requiring the deep networks to be trained. This allows the algorithm to generalize well across multiple modalities (domains). • Benefiting from our no-training design, we can freely generate multiple descriptors for each keypoint. This allows for introducing a consensus mechanism to detect robust keypoint matches among the candidates produced by randomized CNNs. • The proposed pipeline achieves similar performance compared with state-of-the-art detectors and descriptors while performing better on the images of new modalities.

2. RELATED WORK

Traditional keypoint detection and description. In general, a good keypoint should be easy to find and ideally the location of the keypoint is suitable for computing a visual descriptor. Therefore, early works (Harris et al., 1988; Shi et al., 1994; Lowe, 2004; Mikolajczyk & Schmid, 2004) detect keypoint as various types of edges, corners, blobs, shapes, etc. In recent decades, detectors and descriptors like SIFT (Lowe, 2004) , SURF (Bay et al., 2006), and RootSIFT (Arandjelović & Zisserman, 2012) are widely used due to their generality. Benefiting from time efficiency, binary descriptors such as BRIEF (Calonder et al., 2011 ), BRISK (Leutenegger et al., 2011 ), and ORB (Rublee et al., 2011) are also popular in many real-time applications, namely the series works of ORB-SLAM (Mur-Artal et al., 2015; Mur-Artal & Tardós, 2017) . The descriptors designed in the aforementioned traditional approaches, in general, are local statistics with certain extents of invari-



Figure 1: Keypoint matching results on two pairs of color images with certain extents of viewpoint change. On the left part, we apply the trained off-the-shelf SuperPoint (DeTone et al., 2018) and visualize its correct matches shown in green line segments. On the right part, we make use of the same network architecture of SuperPoint, but apply 3 different sets of random parameters. The correct matches produced by the 3 random CNNs are visualized in blue, yellow, and red line segments, respectively. Different network parameters produce different matches, yet we observe their overlaps in several regions. The images are from the 7-Scenes dataset (Shotton et al., 2013).

