LEARNING TO REGISTER UNBALANCED POINT PAIRS

Abstract

Point cloud registration methods can effectively handle large-scale, partially overlapping point cloud pairs. Despite its practicality, matching the unbalanced pairs in terms of spatial extent and density has been overlooked and rarely studied. We present a novel method, dubbed UPPNet, for Unbalanced Point cloud Pair registration. We propose to incorporate a hierarchical framework that effectively finds inlier correspondences by gradually reducing search space. The proposed method first predicts subregions within target point cloud that are likely to be overlapped with query. Then following super-point matching and fine-grained refinement modules predict accurate inlier correspondences between the target and query. Additional geometric constraints are applied to refine the correspondences that satisfy spatial compatibility. The proposed network can be trained in an end-to-end manner, predicting the accurate rigid transformation with a single forward pass. To validate the efficacy of the proposed method, we create a carefully designed benchmark, named KITTI-UPP dataset, by augmenting the KITTI odometry dataset. Extensive experiments reveal that the proposed method not only outperforms state-of-the-art point cloud registration methods by large margins on KITTI-UPP benchmark, but also achieves competitive results on the standard pairwise registration benchmark including 3DMatch, 3DLoMatch, ScanNet, and KITTI, thus showing the applicability of our method on various datasets. The source code and dataset will be publicly released.

1. INTRODUCTION

Point cloud registration is a task that aims to recover 3D rigid transformation between two possibly overlapping point cloud fragments. The rapid advance of commodity 3D sensors gives rise to the necessity of efficient point cloud registration algorithms for numerous real-world applications, including 3D reconstruction for virtual-, augmented-, and mixed reality applications, and the navigation systems of autonomous vehicles or robotic agents. Recent work has made remarkable progress in developing learning-based point cloud registration algorithms for tackling real-world 3D scans (Geiger et al., 2012; Zeng et al., 2017) with high-resolution feature extraction (Choy et al., 2019b; Bai et al., 2020) under presence of low ratio of the inlier correspondences (Choy et al., 2020b; Bai et al., 2021; Lee et al., 2021) or small overlap region between point pairs (Huang et al., 2021) . However, the imbalance issue in terms of spatial extent and point density between the input point clouds is often overlooked, despite its practical utility in the problems such as incremental mapping, or the registration of partial observations and the holistic environment. For instance, there are sensible solutions for registering a pair of 3D LiDAR scans, but registering a single LiDAR scan and a large-scale 3D map still remains challenging. A viable solution is to apply a global localization approach (Uy & Lee, 2018; Komorowski, 2021; Du et al., 2020; Zhang & Xiao, 2019; Liu et al., 2019) , but the existing methods cast the problem as a retrieval task and assume that the 3D map is given as a set of overlapping 3D scans rather than a holistic map, which is not generally applicable to the unbalanced point pairs. Recent feature-based pairwise point cloud registration methods are equipped with matchability detection (Bai et al., 2020) , overlap detection (Huang et al., 2021) , or hierarchical correspondence prediction (Yu et al., 2021) , which are possibly advantageous in registering unbalanced point clouds. However, we empirically found that they collapse in registering unbalanced point clouds. Point cloud description and matching in the modern feature-based registration methods tend to be distracted by similar geometric structures that often appear in the larger point cloud. To this end, we propose UPPNet, the first neural architecture which is designed to be efficient for large-scale Unbalanced Point cloud Pair registration. UPPNet is a hierarchical framework that effectively finds inlier correspondences by gradually reducing search space. In the coarsest level, a submap proposal module proposes the subregions that are likely to be overlapped with the query by utilizing a global geometric context. Then, the coarse-to-fine matching module predicts accurate point-level correspondences by utilizing attention-based context aggregation and solving optimal transport problems. The subsequent structured matching module filters out outlier correspondences that violate spatial compatibility. To evaluate our method, we create a carefully designed benchmark, namely KITTI-UPP dataset, for matching point cloud pairs under the diverse spatial extent and point density imbalance by augmenting the KITTI odometry dataset (Geiger et al., 2012) . The experiment shows that our method improves the Registration Recall on the KITTI-UPP dataset by over 19.6% than state-of-the-art registration pipelines when the target point cloud is 11.1 times spatially larger and 11.7 times denser than the query point cloud. Furthermore, we evaluate the proposed method under the unbalanced indoor environments using ScanNet (Dai et al., 2017) dataset and show that the proposed method can be generalized for indoor RGB-D scans. Finally, to demonstrate the applicability of the proposed method for partially overlapped point cloud pairs, we evaluate our method on the standard pairwise registration benchmarks, 3DMatch, 3DLoMatch (Zeng et al., 2017; Huang et al., 2021) , and KITTI odometry (Geiger et al., 2012) datasets. The proposed method achieves competitive registration accuracy with the modern pairwise registration methods (Bai et al., 2020; Huang et al., 2021; Yu et al., 2021 ). An overview of our method can be found in Figure 1 . Our main contributions are summarized as follows: • We propose a novel hierarchical framework that gradually reduces the search space via submap proposal module and coarse-to-fine matching modules, which can effectively handle unbalanced point cloud registration tasks. • We introduce a new benchmark, namely KITTI-UPP dataset, by carefully augmenting the large-scale outdoor LiDAR dataset (Geiger et al., 2012 ). • Our method can be trained in an end-to-end manner and demonstrates strong generalization ability on a wide range of spatial and point density. One feasible solution to tackle the unbalanced point cloud registration is to device the task into two separate sub-tasks; 1) the nearest frame retrieval and 2) local registration. After the target frame is retrieved, the relative pose between the query and the target frame is predicted by applying the conventional pairwise registration methods. These methods, however, suffer from several drawbacks. First, there are significant overheads in both the computational cost and the memory footprint to represent a scene with a set of multiple overlapping local frames. Second, they are likely not to generalize when point cloud pairs have a domain difference. For instance, considering the registration between the reconstructed point cloud of a large-scale scene and a local scan, the global localization methods fail to be directly adopted due to the severe density difference. Finally, the success of the registration strongly depends on the localization results. They can not recover when the coarse alignment via localization fails.

3. METHOD

This section describes the proposed UPPNet, designed for registering unbalanced point cloud pairs. To handle spatial and density imbalance, we introduce a hierarchical framework consisting of three levels of matching stages, submap matching (Section 3.3), super point matching (Section 3.4), and point matching (Section 3.5). The overview of the proposed pipeline is illustrated in Figure 1 .

3.1. PROBLEM DEFINITION

Given a pair of possibly overlapping point clouds, X ∈ R n×3 , Y ∈ R m×3 , our goal is to estimate the optimal rigid transformation R * , t * that minimizes the geometric error defined as follows: R * , t * = arg min R,t (i,j)∈C ∥Ry i + t -x j ∥ 2 where R ∈ SO(3), t ∈ R 3 are rotational matrix and translation vector, C is a set of true correspondences between X and Y. In this paper, we are interested in the case where V(X) ≫ V(Y), and  𝐶 $ 𝐹 ! $ 𝐹 &" $ + ⓓ Node Grouping 4 𝐹 ! 4 𝐹 &" + ⨁ Slack entries Sinkhorn ! #$% " 𝑍&# = 1 ! &$% " 𝑍&# = 1 " 𝑍 (𝑌′, 𝐹 ! $ ) (𝑆 ( , 𝐹 &" $ ) 4 𝐹 ! 4 𝐹 &" 𝐿 " n ≫ m, where V(•) denotes the spatial volume of the bounding box that tightly covers the input point cloud. In other words, there is spatial and density imbalance between X and Y. From here on, we will refer to the reference point cloud that spans a larger spatial extent with higher density (X) as a map, and a query point cloud with a small and sparse point cloud (Y) as a query. Furthermore, we use the terms spatial imbalance factor , ρ s = V(X) V(Y) , and point imbalance factor, ρ p = n m , to denote the relative imbalances between two point clouds in terms of spatial extent and number of points. Note that we can also calculate the density imbalance factor by ρ d = ρp ρs . Point Matching

3.2. FEATURE EXTRACTION

The proposed UPPNet begins with the feature extraction to encode the geometric context of map and query point clouds. To achieve this goal, we adopt a shared U-shaped network implemented with KPConv (Thomas et al., 2019) to extract the multi-level features. Given an input point cloud X ∈ R n×3 with initial feature F in X ∈ R n×din , the feature extractor f θ (•) outputs the pointwise feature F X ∈ R n×d that encode local geometric context, denoted as (X, F X ) = f θ (X, F in X ) . At the most coarse resolution, i.e., at the end of the encoder, we obtain downsampled points with corresponding feature vectors, and we call the downsampled points as super points and denotes the point coordinates and the features with X ′ ∈ R n ′ ×3 and F X ′ . For each super point x ′ i , the input point cloud X can be partitioned into groups by assigning each point to its closest super point as in Yu et al. (2021) : G i = {x ∈ X ∥ x -x ′ i ∥ ≤ ∥ x -x ′ j ∥, ∀j ̸ = i]}, where we denote the corresponding super point feature map as F ′ Gi .

3.3. SUBMAP MATCHING

Building submaps. The first stage of our hierarchical registration pipeline is the submap proposal that builds the submap candidates that are likely to be overlapped with the query point cloud. To do so, we divide the map into L overlapping submaps. A submap S i is a subset of superpoints of the map, i.e., S i ⊂ X ′ , that lie in i-th submap region S i : S i = {x ′ ∈ X ′ : x ′ ∈ S i }. Each (cubic-shaped) submap region S i with its center c i ∈ R 3 (which we call center point) and edge length of v ∈ R + is formally defined as S i = {s ∈ R 3 : max|s -c i | ≤ v/2}. In this work, we evenly place L overlapping submap regions {S i } L i=1 to cover all the superpoints of the map X ′ with carefully chosen overlapping ratio µ, satisfying |S i ∩ S j |/|S i | = µ and ||c i -c j || = v • µ for any adjacent, overlapping submap regions S i and S j . We set µ = 0.5 in our experiments. The edge length of the submap regions, v ∈ R + , is defined by the spatial size of the query Y: Specifically, we first compute the furthest distance between two points in Y along i-th axis as s i = |max a y a,i -min b y b,i | ∈ R + , thus giving {s i } 3 i=1 of which respective elements represent width (s 1 ), height (s 2 ), and length (s 3 ) of Y. The edge length of submap regions is defined as the largest edge length of Y: v = max i s i . The overview of the submap proposal and grouping is shown in Figure 1 2021), we construct a bipartite graph between query and map point cloud regarding the super points as nodes, and we apply a sequence of self-, cross-, and self-attention layers to fuse the global context between two point clouds and augment each super point feature. Given super point feature maps Global feature aggregation. After we augment super point features, we aggregate super point features into a global descriptor for each submap. The super point features are lifted into a higher dimension by a linear layer and then aggregated into a global descriptor via Generalized Mean Pooling (GeM) (Radenović et al., 2018) . GeM has shown its strength in localization tasks for various modality (Tolias et al., 2015; Komorowski, 2021) . Concisely, GeM aggregates the super point features F ′ Si for submap S i and builds the global descriptor (F ′ X , F ′ Y ), we linearly project the source feature F ′ X to query, Q = W Q F ′ X and target feature F ′ Y to key and value as K = W K F ′ Y , V = W V F ′ Y , respectively, where W Q , W K , F GD Si as F GD Si (k) = 1 |Si| x ′ j ∈Si (F ′ x ′ j (k)) α 1/α , where F GD Si (k) is k -th element of the global descriptor, and α is a learnable parameter. Global feature matching. With the aggregated global descriptors for each submap {F GD Si } and query F GD Y , we calculate the similarity matrix using L2 distance between two global descriptors: d GD i = ∥F GD Si -F GD Y ∥ 2 , d GD = d GD 1 . . . d GD m ∈ R m×1 , where d GD is a one dimensional vector because the query point cloud Y is described with a single global descriptor F GD Y , whereas the map X is described with m global descriptors F GD Si for each submap S i . On train time, we apply a loss function on d GD with groundtruth supervision. On inference time, we pick k submaps with the top-k similarity values, where k is a hyperparameter.

3.4. SUPER POINT MATCHING

After the candidate submaps are proposed with the consideration of the global geometric context, we perform the super point matching. For each proposed submap, the super points belonging to the submap S i are retrieved with corresponding super point features, F ′ Si ∈ R |Si|×d ′ . Then we leverage super point features F ′ Si , F ′ Y of submap and query to calculate the similarity matrix S Si using inner product:  S Si = F ′ Si F ′ T Y , S Si ∈ R |Si|×n ′ (4) Z Si = S Si + N Si z z z , Z Si ∈ R (|Si|+1)×(n ′ +1) We then utilize Sinkhorn algorithm (Sinkhorn & Knopp, 1967) to solve optimal transport problem on Z Si and obtain ZSi . Each entry (i, j) in ZSi indicates normalized probability whether (i, j) correspondence is a true correspondence. By thresholding ZSi by predefined value τ Z , we obtain the set of predicted super point correspondences, C ′ , and they passed to the final point matching module to point correspondences, C. Structured matching. We use spatial compatibility (Bai et al., 2021; Lee et al., 2021) to further reject the outlier correspondences. It considers two correspondences (p i , q i ) and (p j , q j ) are likely to be inliers if their spatial distance |d (p i , p j ) -d (q i , q j )| is smaller than predefined threshold. Extending this concept, for each correspondence, we compute the compatibility score with respect to the number of their compatible correspondences in the whole correspondence set. For example, whole correspondences set is {c1, c2, c3} and compatibility score of c1 will be 2 when c2 and c3 satisfy spatial compatibility with c1. With the compatibility score, we consider correspondences with a low score as outliers and filter out them. In our hierarchical framework, we apply this strategy to our super point matching module and point matching module to extract only reliable correspondence pairs. Furthermore, KNN search is used to guarantee as many inliers as possible for the initial correspondence set. For more details about the structured matching, please refer to Appendix.

3.5. POINT MATCHING

Given predicted super point correspondence, we refine it to point-level correspondences for the final rigid transformation parameter estimation. We expand a single super point correspondence to a pair of point patches with neighboring points around the super points in each point cloud. We use the point-to-super point grouping as described in Eq. 2 and pass the selected point groups to the matching module as in Section 3.4. After applying the threshold to the similarity matrix and structured matching, we get the final set of point-level correspondences between two input point clouds. We then run RANSAC on the correspondence set to estimate the rigid transformation parameter. Note that there are multiple sets of correspondences since we perform the matching module for each proposed submap in parallel. We select the one with the highest inlier ratio among the multiple transformation candidates as our final prediction.

3.6. LOSS

We train our network using the loss function defined as L = L s + λ g L g + λ p L p , where the total loss is the weighted sum of submap matching loss L s , super point matching loss L g , and point matching loss L p . Specifically, we define point matching loss L p as: L p = -i,j Ẑ(i, j) log Z(i, j) i,j Ẑ(i, j) where Z is the predicted similarity matrix after solving optimal transport using Sinkhorn algorithm, Ẑ is the binary matrix that indicates groundtruth. If (i, j) correspondence is a true correspondence then Ẑ(i, j) = 1, and Ẑ(i, j) = 0 otherwise. For L g and L s , we apply the same formulation, but we calculate the overlap ratio between two patches around the super points as ground-truth soft label values. This fact indicates that we only provide a point-wise binary matrix for the supervision, and the similarity matrix for super point matching and submap matching are calculated with Ẑ. Additional supervision rather than the rigid transformation matrix between query and map point clouds is not required. More details on calculating ground-truth similarity matrices are provided in Appendix. 

4. EXPERIMENT

4.1 EXPERIMENTAL SETTINGS KITTI-UPP dataset. To the best of our knowledge, there are no available benchmarks and public datasets targeting the large-scale unbalanced point cloud registration task. To validate the effectiveness of our approach, we introduce KITTI-UPP, a carefully designed dataset for large-scale unbalanced point cloud registration tasks. We build KITTI-UPP by aggregating sequential LiDAR frames for each scene provided by the KITTI Odometry benchmark (Geiger et al., 2012) . To control spatial and density imbalance factor between map and the query as defined in Section 3.1, we tune two parameters when selecting KITTI frames for aggregation: range and hop. The range determines how many frames we use to construct a map, indicating the map size. If the range value is set to 500, we aggregate 500 consecutive LiDAR frames to build a single map. The hop indicates the frame jump for the LiDAR frame aggregation. If we set the hop value as 10, every 10th frame is used for the aggregation. In this manner, the hop controls the density of the aggregated point cloud. In our experiment, we set 300 and 10 as the default values for range and hop, respectively, to train UPPNet and baseline approaches. We then utilize KITTI-UPP scenes made with other hops and ranges that are unseen during the training to analyze the registration performance on the various spatial and density imbalance factors. Note that we ensure that any query is not included in the map. When we aggregate every 10th frames to construct a map, i.e., {10 • i|0 ≤ i ≤ ⌈ range 10 ⌉}, then we select the query frames from a index set {10 • i + 5 |0 ≤ i ≤ ⌈ range 10 ⌉} to ensure that the same frame is not used in both a query and a map. ScanNet benchmark. To validate the robustness of our proposed method in large-scale indoor environments, we use ScanNet (Dai et al., 2017) benchmark which contains 2M RGB-D scans of over 707 unique indoor scenes. To evaluate the methods under an unbalanced environment, we use the provided reconstructed mesh of ScanNet as a map and register a single RGB-D frame with the map, thus being able to omit the labor-heavy process for generating maps as in KITTI-UPP dataset construction. For the unbalanced point cloud registration task. An example pair is illustrated in Figure 2 . Specifically, we use a subset of ScanNet which consists of 12/2/5 scenes for training/validation/testing splits respectively where each scene contains a varying number of scans ranging from 8 to 113. Evaluation metric. We use the standard metric to assess the pairwise registration accuracy using Rotational Error (RE): arccos Tr (R T R)-1 2 and relative Translational Error (TE): ∥t -t∥ 2 , where R, t are the predicted rotation matrix and a translation vector, R, t are the ground-truth. We also use relative Inlier Ratio (IR) and Registration Recall (RR) for the evaluation. IR is defined as the ratio of correspondences whose geometric distances are below the predefined threshold (τ I ) when transformed with the ground-truth transformation. For a registered pair having RE and TE less than the predefined thresholds (τ R , τ t ), we regard it as a successful registration and calculate the RR of successful registration over the entire dataset. For indoor datasets, we report three standard metrics, Under review as a conference paper at ICLR 2023 3 , our method achieves the best results in all metrics. Moreover, our method achieves the best generalization ability w.r.t the various spatial extent and density imbalances as shown in Figure 3 , indicating the advantages of our method. To verify the effectiveness of UPPNet, we conduct additional experiments; incorporating retrieval method with baseline models. In this experiment, we combine pairwise registration methods with MinkLoc3Dv2 (Komorowski, 2022) that shows best performance on 3D retrieval task. For fair comparison, We train the MinkLoc3Dv2 (Komorowski, 2022) on our KITTI-UPP dataset for 200 epochs. Even though pairwise registration methods benefit from MinkLoc3Dv2 (Komorowski, 2022), UPPNet still outperforms baseline models by large margins. We conclude that the method of generating global descriptor in MinkLoc3Dv2 (Komorowski, 2022) is not suitable for our challenging scenario which is under extreme imbalance in terms of spatial extent and point density; aggregating the features of all points rather than reliable points in contrast to UPPNet. Indoor Experiment. We report unbalanced point cloud registration results on large-scale indoor environments, e.g., ScanNet (Dai et al., 2017) achieves 21.1% Registration Recall improvements over CoFiNet, which reveals the robustness of our method for indoor unbalanced point cloud registration as well.

4.3. ANALYSIS

To study the effectiveness of the proposed UPPNet, we conduct extensive ablation experiments and report the results in Table 2 . In Table 2 , we conduct an ablation study on the core design choices of our method: k-nearest-neighbor values and usage of submap proposal and structured matching modules. As shown in Table 2 , k = 16 is the optimal configuration for considering all metrics. In addition, we found that both submap proposal and structured matching modules bring significant improvement in registration recall, where the best performance is achieved when all modules are enabled. 1 and Figure 3 , our method exhibits competitive results from the previous state-of-the-art registration methods on both indoor and outdoor datasets, even with the extremely low overlap scenario. This result suggests that our method is not specialized for the KITTI-UPP benchmark but also applies to the indoor and low overlap environment. For more details on training and evaluation for the 3DMatch dataset, please refer to the Appendix. Finally, we measure the latency of our method and report the breakdown of elapsed times for each module on the KITTI-UPP benchmark in comparison with CoFiNet (Yu et al., 2021) . As reported in Table 3 , our method takes 0.4 seconds more than CoFiNet but improves registration recall by 15.6%.

5. CONCLUSION

In this paper, we presented a neural architecture for unbalanced point cloud registration with extreme spatial scale and point density discrepancy. We proposed a hierarchical framework that finds inlier correspondences effectively by gradually reducing search space to tackle this problem. Our proposed method can handle scale differences by finding subregions that are likely to be overlapped with the query point cloud and estimating the correspondences via a coarse-to-fine matching module to the selected subregions. Finally, structured matching is applied to prune the noisy correspondences further. Our method outperforms the state-of-the-art methods by a large margin in extensive experiments on the challenging dataset. Limitations. We use RANSAC to get the final rigid transformation parameter with the estimated correspondences. Leveraging a differentiable model estimator, e.g., weighted Procrustes method (Choy et al., 2020a) , would be an interesting future research direction.

A APPENDIX

In this supplementary material, we provide the detailed algorithm of the structured matching procedure in Section A.1, the description of how we form the data splits of our KITTI-UPP dataset for training and evaluation in Section A.2, additional details on the experiments in Section A.3, the additional quantitative results in Section A.4 with the implementation details of baseline methods, equations for the loss terms in Section A.5, details of architectural configuration in Section A.6, and finally qualitative results in Section A.7.

A.1 STRUCTURED MATCHING

We provide additional information on our structured matching. As in Section3.4, the initial correspondence set is extracted from similarity matrix Z. Simple strategy is to consider an entry (i, j) with high confidence score in Z as a valid correspondence as in (Yu et al., 2021 ): C = {(i, j)| Z(i, j) > τ Z } (7) However, this strategy is prone to missing inlier correspondences in our unbalanced setting; that is difficult to extract consistent features from the same region of the map and query as the inlier correspondences often have low confidence scores. An alternative solution is to select confident correspondences relatively for each point rather than using an absolute threshold value. We utilize K-nearest-neighbor (KNN) search in feature space and retrieve the top-k candidate correspondences for each point having the highest inlier confidence values: C = {(i, j)| Z(i, j) > Topk( Z(i, :)} This approach can guarantee to select k candidates for each point even if their confidence scores are low. Although more inlier correspondences can be obtained by KNN search, selecting extra correspondences leads to a high number of outliers. To cope with these challenges, we combine spatial compatibility with KNN search for outlier rejection. We first check whether each correspondence satisfies spatial compatibility with others, as illustrated in Figure 4(b) . Assume that we have N correspondences {(x n , y n )} N n=1 where x n , y n ∈ R 3 is a pair of 3D points, i.e., n-th correspondence. The spatial compatibility matrix S ∈ R N ×N encodes relative distances of the correspondences such that S i,j = 1[|d(x i , x j ) -d(y i , y j )| < θ] where θ is distance threshold; score of 1 is assigned to S i,j if a pair of correspondences i, j are spatially consistent and scores of 0 is given otherwise. In this toy example, {c1, c3, c5, c7} are inlier correspondences and {c2, c4, c6, c8} are outlier correspondences. Although we can reject c6 by sampling the correspondences with at least one compatible correspondence, we still suffer from the outliers {c2, c4, c8}. Notably, a large-scale map would have numerous repetitive structures, making it challenging to distinguish hard negative correspondences {c2, c4, c8} using only their associated features, i.e., through first order matching. Instead, we can select correspondences with high compatibility scores, as described in Section 3.4, by counting the number of compatible correspondences for each correspondence. Furthermore, we can calculate compatibility coherence between two correspondences by calculating how many correspondences are compatible with both of them. The calculation of compatibility coherence scores of some correspondence i is formulated as similarities (dot-product) between spatial consistency of correspondence i and those of other correspondences: C i = SS T i,: ∈ R N which counts the number of spatially-consistent matches that correspondence i has in common with others. In other words, compatibility coherence score of correspondences i and j amounts to the number of spatially-consistent correspondences they have in common. As an specific example, correspondences c1 and c3 in Figure 4 (a) have two common spatially-consistent correspondences of c5 and c7 which makes C 1,3 = C 3,1 = 2 while correspondences c6 and c8 do not have any common spatially-consistent correspondences so C 6,8 = C 8,6 = 0. In our experiments, we use the compatibility coherence because it shows better performance in Table 4 . We can get compatibility coherence by simply multiplying the compatibility matrix in Figure 4(b) . Then, we compute the average coherence value per correspondence. Finally, thresholding these values with t indicates which correspondences to keep for the final correspondence set {c1, c3, c5, c7}. In our case, we set this threshold as the mean value of the average coherence values, as shown in Figure 4 (c). As a result, by combining this strategy with KNN search, we firstly can obtain as many inliers as possible and efficiently filter out the outliers through structured matching. Experimentally, combining these two approaches is the key to solving the noisy correspondence problem caused by feature ambiguity common in unbalanced point pairs.  1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 C1 C2 C3 C4 C5 C6 C7 C8 C1 C2 C3 C4 C5 C6 C7 C8 (b) 0 2 0 0 0 2 0 0 0 1 0 0 2 0 0 0 0 2 0 0 1 0 0 0 2 2 0 0 0 0 0 2 1 0 0 2 1 0 0 2 0 0 0 0 0 0 0 0 2 2 0 0 0 2 0 0 0 0 1 1 0 0 0 0 C1 C2

A.2 DATASET

We follow the same data split strategy that FCGF (Choy et al., 2019b) uses for KITTI odometry dataset (Geiger et al., 2012) for our KITTI-UPP dataset. We use sequences 0-5 for training, 6-7 for validation, and 8-10 for testing. For building the input point cloud pair, the query point clouds are sampled from the original frames of the KITTI odometry dataset, and the map point clouds are created by aggregating the LiDAR frames while managing two parameters: range and hop as described in Section 4.1. During training, we set the range and hop value to 300 and 10 to build maps. The query frames are at least 10m apart from each other, and we pick the map which is the closest one to each query among the generated map point clouds. For evaluation, we use varying range and hop values to evaluate the generalization ability of the methods. Especially for the test split, we carefully designed the input point cloud pairs so that the query would not be included in the frames that are used to generate the map. Through this procedure, we yield 1,358 pairs for training, 180 for validation, and 275 for testing.

A.3 ADDITIONAL DETAILS ON THE EXPERIMENTS

Evaluation metric. We use the standard metric to assess the pairwise registration accuracy using Rotational Error (RE): arccos Tr (R T R)-1 2 and relative Translational Error (TE): ∥t -t∥ 2 , where R, t are the predicted rotation matrix and a translation vector, R, t are the ground-truth. We also use relative Inlier Ratio (IR) and Registration Recall (RR) for the evaluation. IR is defined as the ratio of correspondences whose geometric distances are below the predefined threshold (τ I ) when transformed with the ground-truth transformation. For a registered pair having RE and TE less than the predefined thresholds (τ R , τ t ), we regard it as a successful registration and calculate the RR of successful registration over the entire dataset. Implementation details. We implement our method with PyTorch (Paszke et al., 2019) and KP-Conv (Thomas et al., 2019) for efficient 3D kernel-point convolution. We use a U-shaped network equipped with an encoder and decoder, with three layers of KPConv and KPConv transposed layers. The feature dimensions of the point feature and super point feature are 32 and 256, respectively. Input point clouds are downsampled with a 1m voxel size on KITTI-UPP, 30cm on KITTI (Geiger et al., 2012) , and 2.5cm on 3DMatch (Zeng et al., 2017) and 3DLoMatch (Huang et al., 2021) . We set λ g = λ p = 1, and train the network for 200 epochs with Adam optimizer and initial learning rate of 1e-4. On inference time, k is set to 16 for KNN search. The detailed architectural configuration can be found in Figure 7 . All experiments are performed on a single Nvidia Tesla V100 GPU and Intel Xeon Gold 6230R CPU @ 2.10GHz. Pairwise registration on 3DMatch and 3DLoMatch. To evaluate our method on 3DMatch and 3DLoMatch dataset, we use the checkpoint of CoFiNet (Yu et al., 2021) pretrained on the 3DMatch dataset since the network architecture of our approach for the super point and point description is compatible with CoFiNet. • Registration Recall (RR): The ratio of successful registration over the entire dataset. We consider the registration with RMSE ¡ 0.2 (m) as successful for 3DMatch and 3DLoMatch datasets. • Inlier Ratio (IR): The fraction of correct correspondences among the estimated correspondence set. We consider correspondences matched with smaller than 10cm residual as correct ones. • Feature Matching Recall (FMR): The fraction of point pairs with an inlier ratio higher than the predefined value over the entire dataset. We use 5% as the threshold for 3DMatch and 3DLoMatch datasets. A.4 ADDITIONAL QUANTITATIVE RESULTS

More baselines.

We provide the additional comparison of our UPPNet with the state-of-theart pairwise registration methods (Bai et al., 2020; Huang et al., 2021; Yu et al., 2021; Lu et al., 2021) and classical methods (Zhou et al., 2016; Fischler & Bolles, 1981) in Table 5 and Table 6 with varying spatial and point imbalance factors. To compare with classical methods, we evaluate RANSAC (Fischler & Bolles, 1981) and FGR (Zhou et al., 2016) equipped with classical descriptor FPFH (Rusu et al., 2009) on our KITTI-UPP dataset. For RANSAC, we report the results with 4M iterations. They mostly fail on our challenging dataset. We further refine the registration results of both methods by applying post-processing with ICP (Besl & McKay, 1992) . For state-of-the-art methods, we select Predator (Huang et al., 2021) , CoFiNet (Yu et al., 2021) , D3Feat (Bai et al., 2020) , HRegNet (Lu et al., 2021) for the baseline methods as they are likely to be favorable for the unbalanced point cloud registration. For a fair comparison, we train the Predator, CoFiNet and ours for 200 epochs on our KITTI-UPP dataset. For D3Feat and HRegNet, the pretrained model trained on the KITTI Odometry dataset with 200 epochs is finetuned on our KITTI-UPP dataset with additional 50 epochs. More results with varying imbalance factors. To evaluate the generalizability, we compare our UPPNet with the baseline methods under various scales and densities in Table 5 and Table 6 . In Table 5a , we report the results with the fixed range value 100 under different hop value 25 (ρ p = 2.3), 10 (ρ p = 3.7), 5 (ρ p = 4.8), and 1 (ρ p = 6.9). Note that the range value 100 with hop value 25 is the least challenging data which is most similar to a balanced dataset e.g., KITTI odometry dataset. Likewise, the results in Table 5b , 6a, 6b are reported under various hop value 25, 10, 5, and 1 with fixed range value 200, 300, and 400, respectively. As shown in Table 5a -6b, performance of D3Feat (Bai et al., 2020) and CoFiNet (Yu et al., 2021) drops dramatically as the spatial extent and density of map become larger. On the other hand, our UPPNet can maintain the high performance at all evaluation metrics and outperform the baseline methods by a large margin, with robust structured matching and our hierarchical framework. Registration results with different overlap ratios. To verify the robustness of our method with respect to the overlap ratio between query and map, we conduct experiments under varying overlap ratios as well as in presence of spatial and density imbalance. In this experiment, we set the range and hop values to 100 and 25 respectively. Following FCGF (Choy et al., 2019b) , each pair is at least 10m apart. We compute overlap ratio for each pair and assign each pairs to three different groups; pairs with < 70%, 70-85%, and > 85% overlap, and denote each group as low, mid, and high overlaps respectively. The results are shown in Figure 5 as well as the results of two baselines methods, CoFiNet (Yu et al., 2021) and Predator (Huang et al., 2021) As shown in the figure, our method consistently outperforms other baselines by large margin for all overlap ratios of low, mid, and high, gaining notable registration recall improvement of 26% over CoFiNet. We note that our method is not only effective in presence of spatial density imbalance but it is also robust under different overlap ratios. baseline methods since they are optimized for balanced point pairs, and we refer to these results as the theoretical upperbounds of the baselines methods since we can assume the combination of a highly accurate localization algorithm with the baseline methods. As shown in Figure 6 , thanks to the robustness of the submap matching module in establishing reliable submap correspondences, our model without gt submaps achieves relatively small performance drops compared to others (Yu et al., 2021; Huang et al., 2021) . Moreover, we highlight that the proposed structured matching effectively helps our method achieve the best results, outperforming the upper bounds of other baselines by large margins even without ground truth submaps.

A.5 LOSS

In this section, we provide concise equations to calculate the loss terms, L s , L g , and L p . As we described in Section 3.6 in our main manuscript, the point matching loss L p is defined as: L p = i,j Ẑ(i, j) log Z(i, j) i,j Ẑ(i, j) , ( ) where Z is the predicted similarity matrix after solving optimal transport using the Sinkhorn algorithm, Ẑ is the binary matrix that indicates ground-truth. For coarse scale super point and submap level matching, we follow (Yu et al., 2021) and calculate the overlap ratio between two super points, or submaps, to calculate soft-labeled supervision. For super point matching, we calculate the overlap ratio between two super points as: r(i ′ , j ′ ) = |{x ∈ G X i ′ |∃y ∈ G Y j ′ , s.t.∥ Ry + t -x∥ < τ g | |G i ′ | where R, t is the groundtruth rotation and translation and τ g is the predefined distance threshold. If the superpoint (i ′ , j ′ ) is not an inlier correspondence, we assign it to the slack entry in the similarity matrix. For this, we calculate the ratio of points in G i ′ which are overlapped with point cloud Y: r(i ′ ) = |{x ∈ G X i ′ |∃y ∈ Y, s.t.∥ Ry + t -x∥ < τ g | |G i ′ | (11) and finally build the groundtruth similarity matrix Ẑg ∈ R n ′ +1×m ′ +1 as: Ẑg (i ′ , j ′ ) =        min(r(i ′ , j ′ ), r(j ′ , i ′ )) if i ′ < n ′ + 1 ∧ j ′ < m ′ + 1 1 -r(i ′ ) if i ′ < n ′ + 1 ∧ j ′ = m ′ + 1 1 -r(j ′ ) if i ′ = n ′ + 1 ∧ j ′ < m ′ + 1 0 otherwise (12) Then we calculate the loss similar with point matching loss. L g = -i ′ ,j ′ Ẑg (i ′ , j ′ ) log Zg (i ′ , j ′ ) i ′ ,j ′ Ẑg (i ′ , j ′ ) Note that unlike the point matching loss L p , super point matching loss is supervised with the continuous soft labeled supervision Ẑg . Similarly, we can calculate the submap matching loss L s : L s = -i ′ ,j ′ Ẑs (i ′ , j ′ ) log Zs (i ′ , j ′ ) i ′ ,j ′ Ẑs (i ′ , j ′ ) , ( ) where Ẑs is the ground-truth similarity matrix at submap level, and it can be calculated in a similar way.

A.6 NETWORK ARCHITECTURE

UPPNet adopt a shared U-shaped network based on KPConv (Thomas et al., 2019) to extract multilevel features. Detailed architecture is demonstrated in Figure 7 . Compared to CoFiNet (Yu et al., 2021) , additional layers for the submap matching module are added to our network. We generate a global descriptor through Generalized Mean Pooling (GeM). As the global descriptor of the query is a single vector, the self-attention module is applied only for global descriptors of the map, such as in Figure 7 .



Figure 1: (Left) Overview of UPPNet. Given an unbalanced point cloud pair, UPPNet hierarchically reduce the search space by utilizing multi-level features. (Top right) Super point features get strengthen via Attention-based Context aggregation module, shown at a ⃝, and aggregated into global descriptor using Generalized mean pooling, shown at c ⃝. We select the submaps with top-k similarity. (Middle right) We build a similarity matrix using super point features in each selected submap. We solve the optimal transport problem on the similarity matrix, and the super point correspondences are estimated, shown at e ⃝. (Botton right) Super point correspondences are refined to point correspondences by utilizing node grouping, shown at d ⃝, and matching module. f ⃝ Structured matching filters out the noisy correspondences that do not satisfy spatial compatibility. Modules in UPPNet, including feature extraction and correspondence estimation, can be trained in an end-to-end manner with the three losses L s , L g , and L p .

(b). Attention-based context aggregation. The super point feature maps of query and submaps are then strengthened via attention-based context aggregation module (Yu et al., 2021; Huang et al., 2021; Sarlin et al., 2020) to incorporate the global geometric context. As in Huang et al. (

and W V are the learnable parameters. To calculate the message M that flows in the graph, we use attention-based aggregation as follows: M = QK T √ b • V, where b is the channel dimension of super point features. When calculating self-attention messages, we take (F ′ X , F ′ X ) and (F ′ Y , F ′ Y ) as input and flow messages to corresponding subgraphs.

Figure 2: Qualitative results on the KITTI-UPP dataset (a) and ScanNet dataset (b). Left figure for each dataset shows query point cloud and right one shows our registration result on unbalanced point cloud pairs. The green lines indicate inlier correspondences and the red lines indicate the outlier correspondences.

Figure 3: Evaluation results on KITTI and KITTI-UPP benchmark with various spatial (ρ s ) and point (ρ p ) imbalance factors. The regions on the left side of the gray lines indicate the balanced pairwise registration environment of the standard KITTI dataset on which the previous pairwise registration methods mainly handle. For the experiments with various point imbalance factors (left), we fix the range value to 100 (ρ s = 2.7) and change the hop value. For the experiments with various spatial imbalances (right), we fix the hop value to 25 (ρ p = 6.9) and change the range value.

Figure 4: (a) Example of initial correspondences where green and red lines indicate inliers and outliers. (b) Spatial compatibility matrix. (c) Compatibility coherence matrix.

the recent literature (Choy et al., 2019b; Bai et al., 2020; Huang et al., 2021; Yu et al., 2021), we use three metrics to assess the registration performance of the methods:

Raguram et al., 2012; Hoseinnezhad & Bab-Hadiashar, 2011; Chum & Matas, 2005) are one of the most popular methods. Recent studies of learning-based outlier rejection methods apply for the problem of two-view correspondences (Moo Yi et al., 2018; Zhang et al., 2019; Brachmann et al., 2017) and 3D correspondences (Choy et al., 2020b;a). Lee et al. (2021) utilizes Hough voting in 6D parameter space and learnable refinement module, resulting in a robust and efficient registration pipeline for the large-scale point cloud. Bai et al. (2021) incorporate the spatial compatibility to filter out noisy correspondences. Yu et al. (2021) proposed to avoid keypoint detection by incorporating hierarchical correspondence extraction modules and Lu et al. (2021) present specialized pipeline for large-scale LiDAR scans. However, none of those methods are explicitly designed for the unbalanced point cloud registration and we empirically found that those methods collapse under the extreme imbalance in terms of spatial extent and point density. Global localization. The early global localization algorithms are based on the 2D images (Chen et al., 2017; Sarlin et al., 2019; Sattler et al., 2017), where input images are matched against the 3D map reconstructed with Structure from Motion (SfM). They often cast a visual localization task as a retrieval problem, where the query images are described with global descriptors, and the most similar point features are retrieved from the database. Vector of Locally Aggregated Descriptors (VLAD)

Neighborhood constraint. The similarity matrix S Si effectively convey the geometric information with moderate receptive field size. FollowingLu et al. (2021), we incorporate additional geometric context, called neighborhood constraint, to handle challenging cases. As described in Eq. 2, each point x ∈ X are assigned to the closest super point. For each super point x ′ i ∈ X ′ , we aggregate the neighbor features of points within the partition G i using max pooling and denote the aggregated features as neighbor feature , FGi . Now we calculate another similarity matrix N Si = FT Si FY , called neighbor similarity matrix. The final similarity matrix is defined as addition of two similarity matrices augmented with additional row and column for slack entries to handle unmatched super points as suggested inYu et  al. (2021); Sarlin et al. (2020):

Quantitative registration results on 3DMatch (Zeng et al., 2017), 3DLoMatch (Huang et al., 2021), and ScanNet (Dai et al., 2017).

Ablation study on (Left) k-nearest-neighbors, (Right) submap proposal and structured matching modules. The range value and hop value are set to 500 (ρ s = 11.1) and 25 (ρ p = 11.7).

Model runtime comparisons on KITTI-UPP dataset. Range and hop values are the same with Table 2.

Ablation study on compatibility matrix of score and similarity. The range value and hop value are set to 500 (ρ s = 11.1) and 25 (ρ p = 11.7).

Evaluation results on KITTI-UPP benchmark under various point imbalance factors (ρ p ) and scale imbalance factors (ρ s ). We evaluate the methods by changing hop value while we fix range value to (a) 100 (ρ s = 2.7), (b) 200 (ρ s = 4.7).

annex

Table 6 : Evaluation results on KITTI-UPP benchmark under various point imbalance factors (ρ p ) and scale imbalance factors (ρ s ). We evaluate the methods by changing hop value while we fix range value to (a) 300 (ρ s = 6.8), and (b) 400 (ρ s = 9.0). Registration results with ground-truth submap. To validate the effectiveness of our proposed submap matching and structured matching modules, we analyze the registration performance of all methods when the ground-truth submap is provided. Note that this setting is advantageous for the Self Attention, N In Figure 8 , we provide additional qualitative results of our method along with the baseline methods on the KITTI-UPP test dataset. 

