A GENERAL RANK PRESERVING FRAMEWORK FOR ASYMMETRIC IMAGE RETRIEVAL

Abstract

Asymmetric image retrieval aims to deploy compatible models on platforms of different resources to achieve a balance between computational efficiency and retrieval accuracy. The most critical issue is how to align the output features of different models. Despite the great progress, existing approaches apply strong constraints so that features or neighbor structures are strictly aligned across different models. However, such a one-to-one constraint is too strict to be well preserved for the query models with low capacity. Considering that the primary concern of the users is the rank of the returned images, we propose a generic rank preserving framework, which achieves feature compatibility and the order consistency between query and gallery models simultaneously. Specifically, we propose two alternatives to instantiate the framework. One realizes straightforward rank order preservation by directly preserving the consistency of the sorting results. To make sorting process differentiable, the Heaviside step function in sorting is approximated by the sigmoid function. The other aims to preserve a learnable monotonic mapping relationship between the returned similarity scores of query and gallery models. The mapped similarity scores of gallery model are considered as pseudosupervision to guide the query model training. Extensive experiments on various large-scale datasets demonstrate the superiority of our two proposed methods.

1. INTRODUCTION

In recent years, deep representation learning methods (Babenko et al., 2014; Tolias et al., 2016; 2020) have achieved great progress in image retrieval. Typically, most existing image retrieval tasks belong to symmetric image retrieval, in which a deep representation model is deployed to map both query and gallery images into the same discriminative feature space. During online retrieval, gallery images are ranked by sorting their feature distances, e.g., cosine similarity or Euclidean distance, against query image. To achieve high retrieval accuracy, most existing methods tend to deploy a large powerful representation model. In a real-world visual search system, the gallery side is usually on the cloud-based platforms, which have sufficient resources to deploy large powerful models. As for the query side, e.g., mobile phone or smart camera, its resources are too constrained to meet the demand of deploying large models. To strike a balance between performance and efficiency, it is better to deploy a lightweight model on the query side, while a large one for the gallery side. This setup is denoted as asymmetric image retrieval (Duggal et al., 2021; Budnik & Avrithis, 2021) . For asymmetric retrieval, how to align the embedding spaces of the query and gallery models is the core problem. To this end, BCT (Shen et al., 2020) first introduces feature compatibility learning. Concurrent work AML (Budnik & Avrithis, 2021 ) learns the query model by contrastive learning with gallery model extracting features for positive and negative samples. Recently, CSD (Wu et al., 2022b) achieves promising results by considering both first-order feature imitation and second-order neighbor similarity preservation during the learning of the query model. Despite the great progress, existing methods enforce the consistency of features or neighbor structures across different models, which is too strict to be well preserved for lightweight query models with low capacity. For users, the order of the returned images plays a more important role than the image features or similarity scores. Strictly enforcing the feature-level one-to-one consistency is not the best choice to achieve better asymmetric retrieval accuracy, while rank preserving deserves more attention. To address the above issues, we propose a general rank preserving framework, which directly optimizes the consistency of rank order to realize the feature compatibility across query and gallery models implicitly. Specifically, for a training image, we first extract its features with the query and gallery models, respectively. Then, the gallery feature is utilized for symmetric retrieval in a database, in which images are also embedded by the gallery model. After that, the ranking list and similarity scores are returned. We select the top K images in the ranking list and calculate their asymmetric similarity scores with the query feature. These asymmetric similarity scores may result in different rank orders from those returned by symmetric retrieval. Thus, two instantiation methods are proposed to achieve the consistency of these two rank orders. The first aims to directly optimize the rank order consistency of the sorting results. To make the sorting process differentiable, the sigmoid function is adopted to approximate the Heaviside step function (Davies, 1978) , which is typically used for numerical comparison. Secondly, we propose to maintain a learnable monotonic mapping relationship between the symmetric and asymmetric similarity scores. A learnable monotonically increasing function is applied to the similarity scores returned by the symmetric retrieval, which will serve as the pseudo-supervision of the query model. Then, we constrain the consistency between the mapped similarity scores and the asymmetric similarity scores to optimize the query model. Notably, both two instantiations preserve the rank order of the images, which are returned by symmetric retrieval. Compared with previous methods, our framework has a unique advantage. It does not constrain the query model to mimic the features or the overall neighbor structures of the gallery model. Instead, it expects that the query model maintains the order of the returned images from symmetric retrieval. Thus, our framework weakens the restriction on the query model with low capacity, and reduces the risk of overfitting during learning the query model. Besides, our framework utilizes no annotation or label of the training data, which makes it flexible and adaptable in various real-world scenarios. To evaluate our approach, comprehensive experiments are conducted on four popular retrieval datasets. Ablations demonstrate the effectiveness and generalizability of our framework. Our approach surpasses the existing state-of-the-art methods by a considerable margin.

2. RELATED WORK

Image Retrieval. Given a large corpus, image retrieval aims to efficiently find the images, which contain the same object or describe the same content with the queries, based on their feature similarities. Most of the traditional image retrieval systems are based on local features (Lowe, 2004; Bay et al., 2006) and bag-of-words (Sivic & Zisserman, 2003; Philbin et al., 2007) representations borrowed from text retrieval. There are also several aggregation methods including VLAD (Jégou et al., 2011) , Fisher vectors (Perronnin et al., 2010) and ASMK (Tolias et al., 2013) , which are used to aggregate local features into compact representations for efficient search. Recently, with the proposed various pooling methods (Kalantidis et al., 2016; Tolias et al., 2016; Radenović et al., 2018b) and loss functions (Revaud et al., 2019; Deng et al., 2019; Weinzaepfel et al., 2022) , deep learning has greatly improved the performance of image retrieval. Despite the great progress, a large deep model is usually deployed for its optimal performance, which, however, is not applicable in some resource-constrained scenarios. In this work, we focus on asymmetric retrieval, where the query (user) side deploys a lightweight model while the gallery side applies a large model. Feature Compatibility. The core of asymmetric retrieval is to align the features of the query and gallery models, which is also known as feature compatibility. BCT (Shen et al., 2020) first formulates the problem of feature compatible learning and reuses the old classifier for the query model training. AML (Budnik & Avrithis, 2021) achieves the feature compatibility by performing asymmetric contrastive learning between different models. After that, CSD (Wu et al., 2022b) achieves the preservation of neighbor similarities in the embedding space of the gallery model in an unsupervised manner. As for HVS (Duggal et al., 2021) , both parameter and architecture are considered simultaneously in a unified framework, which gives promising performance. Other lines of research follow the model regression problem (Yan et al., 2021; Zhang et al., 2022; Duggal et al., 2022) during gallery model updating, which is also related to feature compatibility. In this work, we start from the main concern of the users, i.e., the order of the returned images in the ranking list, and propose a general rank preserving framework, which is free of annotations from training datasets. Lightweight Network. Thanks to the evolution of the network architecture (He et al., 2016; Tan & Le, 2021) , deep convolutional neural networks (CNNs) achieve superior performance in various computer vision tasks. Usually, a large powerful model leads to better performance with the consumption of more storage and computational resources. Real-world tasks aim to achieve the best accuracy with a limited computational budget, which is determined by the target platforms. The demand of deploying high-performance deep models on resource-constrained platforms has led to a series of studies on model compression (Antonio et al., 2016; He et al., 2018b; Oktay et al., 2020) and efficient network architecture design, e.g., MobileNets (Howard et al., 2017; Sandler et al., 2018) , ShuffleNets (Zhang et al., 2018; Ma et al., 2018) and EfficientNets (Tan & Le, 2019) . In this work, we focus on asymmetric retrieval in resource-constrained scenario. Since query features are extracted on resource-constrained end platforms, our approach employs the various lightweight models mentioned above as query models. Smooth Rank Approximation. There is a long history of designing smooth surrogate for rank approximation. In image retrieval, the well-known surrogate is to constrain the relative relationships between pairs (Raia et al., 2006) or triplets (Gordo et al., 2017) , which implicitly leads to partial ranking. Some methods propose to utilize a smooth discretization of similarity scores (He et al., 2018a; Cakir et al., 2019; Ustinova & Lempitsky, 2016; Revaud et al., 2019) for the rank approximation. Other approaches explicitly approximate the non-differentiable rank metric with a neural network (Engilberge et al., 2019) or a sum of sigmoid functions (Brown et al., 2020; Huang et al., 2022; Patel et al., 2022) . Recently, a more accurate and robust approximation method is proposed in ROADMAP (Elias et al., 2021) . However, all the methods mentioned above are designed for symmetric retrieval, where only a single model exists and no cross-model feature compatibility has been considered. Thus, they cannot be directly applied for asymmetric retrieval. In our approach, asymmetric and symmetric retrieval are performed with the same query. Then, the order of two returned ranking lists are constrained to be consistent, which also ensures feature compatibility.

3. BACKGROUND: ASYMMETRIC IMAGE RETRIEVAL

Given images of interest (referred to as query set Q), image retrieval targets at correctly finding images of the same content or object from a large-scale gallery set G. An image encoder φp¨q is deployed to map the images into L 2 -normalized feature vectors. Then, the cosine similarity of two normalized vectors is used for measuring the similarity between query and gallery images. Usually, some metrics, e.g., mean Average Precision (mAP), are adopted to evaluate a retrieval system, which are conditioned on φp¨q, Q and G. For convenience, we ignore query and gallery sets and denote the metric as Mpφ q p¨q, φ g p¨qq, where φ q p¨q and φ g p¨q are the image encoders deployed for query and gallery feature extraction, respectively. In a conventional symmetric retrieval system, the same encoder, i.e., φ g p¨q " φ q p¨q, is used for both gallery and query sides. Typically, deploying a powerful model results in better retrieval accuracy. However, it is not applicable in some resource-constrained platforms, e.g., mobile devices or smart cameras. Assuming φ q p¨q is different from and significantly smaller than φ g p¨q, asymmetric image retrieval leverages φ q p¨q to embed query images and φ g p¨q to embed the gallery images. Thus, we should ensure that the lightweight query model φ q p¨q maps images into the same embedding space of the large gallery model φ g p¨q. Besides, an asymmetric retrieval system is expected to achieve a retrieval accuracy comparable to that of a symmetric retrieval system (Duggal et al., 2021) , Mpφ g p¨q, φ g p¨qq « Mpφ q p¨q, φ g p¨qq ą Mpφ q p¨q, φ q p¨qq. (1)

4. RANK PERSEVERING FRAMEWORK

An overview of our framework is shown in Fig. 1 . Given a well-trained gallery model φ g p¨q, we aim to learn a lightweight query model φ q p¨q to be compatible with it. Assume there exits a training gallery set G t , we first embed it into F " rf 1 g , f 2 g , . . . , f N g s P R N ˆd with φ g p¨q: φ g p¨q and query model φ q p¨q encode it into features g and q, respectively. g is treated as the query to search in a training gallery set G t , of which the images are also embedded by φ g p¨q. We fetch the features F K of the top-K images in the ranking list and calculate the asymmetric similarity scores S q with q. Two instantiations of our framework named Ranking Order Preservation (Sec. 4.1) and Monotonic Similarity Preservation (Sec. 4.2) are proposed to ensure the consistency of rank orders when the query is embedded by φ q p¨q and φ g p¨q, respectively. f i g " φ g pg i q P R d , i " 1, 2, . . . , N, where g i is the i-th image in the training gallery set G t . During the learning of φ q p¨q, gallery model is fixed. For each training image x, we extract its features q and g with φ q p¨q and φ g p¨q, respectively: q " φ q pxq P R d , g " φ g pxq P R d . (3) Then, we perform symmetric retrieval in G t with g as the query. After that, we obtain the ranking list R " rr 1 , r 2 , . . . , r K s P R K and similarity scores S g " rg T f r1 g , g T f r2 g , . . . , g T f r K g s P R K of top-K images, where r i denotes the ID of the i-th image in G t . Notably, the values in S g satisfy the monotonically decreasing property, i.e., g T f r1 g ě g T f r2 g ě ¨¨¨ě g T f r K g . With the ranking list R, the corresponding feature embeddings F K " rf r1 g , f r2 g , . . . , f r K g s for the top-K images are taken from F . Then, we calculate the asymmetric similarity scores between the query feature q and F K : S q " rq T f r1 g , q T f r2 g , . . . , q T f r K g s P R K . (4) In CSD (Wu et al., 2022b) , the consistency of S q and S g is directly constrained to optimize the query model. However, we argue that it is too strict to be preserved well for the lightweight model with low capacity. Besides, the user experience is mainly influenced by the rank order of the returned images rather than the specific similarity scores. Thus, it is better to directly impose the constraint on the rank order of the returned images rather than the specific similarity scores. Specifically, we need to ensure the values in S q are also monotonically decreasing: q T f r1 g ě q T f r2 g ě ¨¨¨ě q T f r K g . To this end, we propose two methods to instantiate the rank preserving framework. One aims to achieve straightforward Rank Order Preservation (ROP) by constraining the consistency of the sorting results, which are formulated as the indicator matrices in Sec. 4.1. Typically, it is not feasible to optimize the sorting results directly due to the non-differentiable Heaviside step function (Davies, 1978) , which is used for numerical comparison in sorting. Thus, sigmoid function is introduced as the smooth approximation, which enables the optimization of sorting process with back-propagation methods. The other is Monotonic Similarity Preservation (MSP), which aims to preserve a learnable monotonic mapping relationship between the returned similarity scores S q and S g . Specifically, we learn a monotonically increasing function, which predicts S q given S g . Then, the consistency between the mapped S g and S q is restricted to train the query model.

4.1. RANK ORDER PRESERVATION

The motivation of our method is directly preserving the rank order of the returned images between asymmetric and symmetric retrieval. Thus, the critical problem is to select a suitable evaluation metric as the constraint. In this section, we dive into the sorting process and constrain the sorting results of symmetric and asymmetric retrieval to be consistent. Sorting process. Sorting the similarity scores between the query and database images is an essential operation to get the final ranking list in image retrieval. Numerical comparison is the most fundamental operation in various sorting algorithms. In this work, we take the comparison sorting as the sorting algorithm. Assume there exists a list of similarity scores S " rs 1 , s 2 , . . . , s N s P R N needed to be sorted. We first compute the difference matrix D P R N ˆN between any elements in S. Then, the binary indicator matrix I P R N ˆN is calculated by applying the Heaviside step function Hpxq, which is equal to 0 for negative values, otherwise equal to 1, on the difference matrix D: D " » - - s 1 . . . s N . . . . . . . . . s 1 . . . s N fi ffi fl ´» - - s 1 . . . s 1 . . . . . . . . . s N . . . s N fi ffi fl , I " » - - Hps 1 ´s1 q . . . Hps N ´s1 q . . . . . . . . . Hps 1 ´sN q . . . Hps N ´sN q fi ffi fl . (6) Each element I i,j in the indicator matrix I denotes the relative relationship between the similarity scores s i and s j . I i,j " 1, if s j is larger than or equal to s i , otherwise, I i,j " 0. Indicator matrix consistency. Since the indicator matrix demonstrates the relative relationships of similarity scores, we take it as the constraint for optimizing the query model. If the indicator matrix is well preserved, the asymmetric retrieval will result in the same ranking list as the symmetric retrieval. Specifically, we take S g and S q into Eq. ( 6) to get the corresponding indicator matrices I g P R KˆK and I q P R KˆK : I g " » - - 1 . . . Hp∆ K,1 g q . . . . . . . . . Hp∆ 1,K g q . . . 1 fi ffi fl " » - - 1 . . . 0 . . . . . . . . . 1 . . . 1 fi ffi fl , I q " » - - 1 . . . Hp∆ K,1 q q . . . . . . . . . Hp∆ 1,K q q . . . 1 fi ffi fl , where ∆ m,n l " S m l ´Sn l , l P tq, gu. Generally, if one gallery image has a higher similarity score against query image, it is more possible to be the true positive, which deserves more attention during the training of query model. To this end, we design a ranking weight W i " exppS i g {τ r q{pi řK l"1 exppS l g {τ r qq, i " 0, 1, . . . , K, where τ r is the temperature. Finally, the weighted indicator matrix consistency loss is minimized as the final objective function to train the query model: L ROP " K ÿ i"1 W i › › I g i,: ´Iq i,: › › 2 2 " K ÿ i"1 K ÿ j"1 W i pHp∆ i,j g q ´Hp∆ i,j q qq 2 . ( ) Heaviside step function approximation. Unfortunately, the derivative of the Heaviside step function Hpxq is Dirac delta function δpxq, which is either flat everywhere, with zero gradient, or discontinuous, and hence cannot be optimized with gradient-based method. Inspired by (Brown et al., 2020) , the sigmoid function σpx, τ q " 1 1`e ´x τ , where τ denotes the temperature adjusting the sharpness, is used to approximate the Heaviside step function smoothly. As shown in Fig. 2 , the temperature governs the approximation tightness and the gradient-effective interval. Substituting σpx, τ q into Eq. ( 8), the weighted indicator matrix consistency loss is approximated as: L ROP " K ÿ i"1 K ÿ j"1 W i pHp∆ i,j g q ´σp∆ i,j q , τ qq 2 . (9)

4.2. MONOTONIC SIMILARITY PRESERVATION

In this section, we introduce another method to preserve the rank order. In Fig. 3 , we visualize the distributions of the similarity score pairs s q and s g in existing methods. It is observed that same s g may correspond to a wide range of values for s q . In other words, the similarity score pairs locate on wide strips, which means that existing methods do not well preserve the order of the images returned by symmetric retrieval. We think it is due to the fact that existing methods all impose a strict one-to-one constraint, which may cause overfitting for the query model with low capacity. Therefore, we introduce a learnable monotonic mapping function f pxq applied to S g to form the pseudo-supervision of S q . Formally, M i g " f pS i g q, i " 1, 2, . . . , K. This avoids the strict neighbor structure alignment between query and gallery models. After that, the Kullback-Leibler (KL) divergence, which shows excellent performance in CSD (Wu et al., 2022b) , between M g and S q is adopted as the final objective function to optimize the query model. Specifically, we first convert M g and S q into the form of probability distributions: p i g " exp `M i g {τ g řK l"1 exp `M l g {τ g ˘, p i q " exp `Si q {τ q řK l"1 exp `Sl q {τ q ˘, i " 1, 2, . . . , K, where τ q and τ g are temperature coefficients. Both τ q and τ g are set less than 1 to keep φ q p¨q focus on the top images of the ranking list. Then, the monotonic similarity preservation loss is defined as L MSP " KLpp g ||p q q " K ÿ i"1 p i g logp p i g p i q q. ( ) To preserve the order of each elements in S g , we should ensure that the function f pxq is monotonically increasing. In this work, we consider three common families of the monotonically increasing functions, which are discussed in the following.

Logarithmic function.

Considering that the definition domain of the logarithmic functions is p0, `8q and the cosine similarity lies between -1 and 1, we define the mapping function as f pxq " log a px `1q, p´1.0 ă x ă 1.0, a ą 1q, where a is a learnable parameter. Exponential function. Another common monotonically increasing function is the exponential function. To make each M i g exist in the range less than 1.0, f pxq is defined as f pxq " a x´1 , p´1.0 ă x ă 1.0, a ą 1q, where a is also a learnable parameter. Polynomial function. There are a wide variety of polynomial combinations. In this work, we consider a simple case. We first choose a set of basis functions X " tx α , x 2α , . . . , x N α u, then the mapping function is defined as linear combinations of those basis functions. Formally, f pxq " ř N i"1 a i x iα , p´1.0 ă x ă 1.0, α ą 0q, where ta 1 , a 2 , . . . , a N u is a set of learnable parameters. To ensure that f pxq is monotonically increasing, each a i should be greater than 0. Besides, we also make the sum of a i equal to 1 to control the range of f pxq. Relation with contextual similarity distillation. In CSD (Wu et al., 2022b) , KL loss strictly restricts the consistency between S g and S q . When the loss minimum is achieved, a linear relationship τ q S g " τ g S q between S g and S q is presented. In MSP, when the mapping function is set to f pxq " x (diagonal line in Fig. 3 ), M g is equal to S g , the loss function L MSP degrades to the form in CSD. Thus, CSD is a special case of our monotonic similarity preservation. As shown in Fig. 3 , our MSP method realizes a narrower striped similarity distribution, since it introduces a learnable monotonically increasing mapping between symmetric and asymmetric similarities. As a result, our MSP method could preserve the order of the returned images better than CSD.

5.1. IMPLEMENTATION DETAILS

Training datasets. Two datasets are used for training. One is SfM-120k (Radenović et al., 2018b) , of which 551 3D models are taken for training while the other 162 3D models for validation. The other is GLDv2 (Weyand et al., 2020) , which consists of 1, 580, 470 images with 81, 311 classes. We randomly sample 80% images from GLDv2 for training and leave the rest 20% for validation. Evaluation datasets and metrics. We evaluate the trained query models on four datasets under the setting of asymmetric retrieval, including GLDv2-Test (Weyand et al., 2020) , Revisited Oxford with R1M (ROxf + R1M), Revisited Paris with R1M (RPar + R1M) (Radenović et al., 2018a) and INSTRE (Wang & Jiang, 2015) . The evaluation metric for GLDv2-Test is mAP@100, while all other datasets are mAP. See App. A.1 for the more detailed descriptions of the testing sets. Architectures. Following the settings in CSD (Wu et al., 2022b) , we choose ResNet101 (He et al., 2016) trained by GeM (Radenović et al., 2018b) and DELG (Cao et al., 2020) as the gallery models. For the query model, common lightweight models, e.g., ShuffleNets (Ma et al., 2018) , MobileNets (Sandler et al., 2018) and EfficientNets (Tan & Le, 2019) , are chosen. To adapt the lightweight model for the image retrieval task, we adjust the model architecture slightly. The details are present in App. A.1, with computation and parameter complexity statistics.

5.2. ANALYSIS AND ABLATIONS

In this section, we analyze the proposed rank preserving framework and perform exhaustive ablations. R101 and Mv2 denote ResNet101 and MobileNetV2, respectively. "Ours" and "Ours ˚" denote that we train lightweight query model φ q p¨q with ROP and MSP constraints, respectively. Length K of the ranking list. In Fig. 4 , we compare our proposed methods against CSD when ranking list has different length K P t64, 256, 512, 2048, 4096, 8192u. As the length increases, the performance increases but saturates when K " 4096, after which, the performance decreases. When K is small, the query model only needs to focus on the elements in front of the ranking list, without taking full advantage of the order information in the ranking list. Thus, the performance is unsatisfactory. On the contrary, when K is particularly large, the query model is concerned with a very wide range of elements in the ranking list. Images at the bottom of the ranking list are far away from the query, their relative orders have almost no effect on the overall retrieval accuracy. Constraining the query model to preserve the order of this part leads to a decreased performance. Heaviside step function approximation. In Tab. 1, we investigate the effect of the temperature τ , which governs the smoothness of the sigmoid function. Results show that τ " 0.1 leads to the best performance over different datasets. As explained in Sec. 4.1, a smaller value of τ leads to a narrower gradient-effective interval (Fig. 2 ) and a tighter approximation to the Heaviside step function. The strong acceleration of the gradient around zero encourages moving instances in the embedding space, leading to a change of rank. However, excessively small τ causes gradient vanishing, which is harmful to the optimization of the query model. In contrast, a large value of τ provides a wide gradient-effective interval at the cost of a looser approximation to the true order. (2) CVNet (Lee et al., 2022) based on global pooling. (3) Token (Wu et al., 2022a) based on local feature aggregation. MobileNetV2 is adopted as the query model for all settings. Two on the right: GeM serves as the gallery model, and four lightweight models, including ShuffleNetv2 (0.5ˆ) (Sv2 0.5ˆ) , ShuffleNetv2 (Sv2), EfficientNetB0 (Eb0) and EfficientNetB1 (Eb1), are chosen as query models, respectively. (Budnik & Avrithis, 2021) ; HVS: (Duggal et al., 2021) ; LCE: (Meng et al., 2021) . Rank weight W in Eq. ( 9). Tab. 2 shows the impact of different ranking weight W . When no weight is used, i.e., W i " 1, it leads to unsatisfactory results, and the best results are obtained when both similarity score S i and ranking position i are considered, simultaneously. Since the images at bottom of the ranking list are more likely dissimilar from the queries, keeping the order between them wastes the representation capability of the query model. It should focus more on the order of samples at the top of the ranking list. Various query and gallery models. In Fig. 5 , we study the adaptation to different model architectures. Specifically, models with different architectures are adopted as gallery and query models. See App. A.1 for the number of parameter and computation complexity of different models in details. Our methods outperform CSD in all settings, demonstrating the superiority of rank preserving to the strict neighbor structure alignment. participate in the training. In Tab. 4, compared with two other unsupervised algorithms Reg and CSD, our methods achieve the optimal performance. Besides, it shows that when evaluating on a test set, the model trained with the corresponding gallery images leads to better performance. Except for the experiments present in Tab. 4, there is no experiment in the paper, which uses the gallery images from the testing sets to participate in the training.

5.3. COMPARISON TO THE STATE-OF-THE-ART METHODS

We compare our method to the state-of-the-art methods in Tab. 5 and Tab. 6. First, we observe that methods based on feature imitation, e.g., Reg, give inferior asymmetric retrieval performance. CSD achieves better results, owing to taking feature preservation and neighbor structure alignment into consideration, simultaneously. Our methods maintain the order of the returned images in ranking list, which is directly related to user experience. It achieves the best performance on all evaluation datasets under the asymmetric setting. When trained with GeM as gallery model and SfM-120k as training set, our framework outperforms the best previous method CSD in mAP by 4.09%, 2.47% on ROxf + R1M and 3.16%, 2.25% on RPar + R1M, with Medium and Hard protocols, respectively. The results on INSTRE and GLDv2-Test also confirm the superiority of our methods.

6. CONCLUSIONS

In this paper, we present a general rank preserving framework for asymmetric image retrieval. Different from strictly feature imitation and neighbor structure alignment, we focus on preserving the order of the returned images in the ranking list when the same query is utilized for symmetric and asymmetric retrieval, respectively. To this end, we devise two instantiations. One is directly constraining the consistency of the sorting results. To make sorting differentiable, sigmoid function is introduced as the smooth approximation for the non-differentiable Heaviside step function used in sorting. The other aims to preserve a monotonic relationship between the returned similarity scores of symmetric and asymmetric retrieval. It introduces a learnable monotonically increasing function to the similarity scores of the symmetric retrieval, which is considered the target of the asymmetric similarity scores. The proposed framework requires no annotation or labels during training, which shows broad applicability and great generalizability in our extensive experiments. is pre-trained in the embedding space of the gallery model and remains fixed during the training of the query model. In Tab. 7, we show the number of parameters and computational complexity (in FLOPS) of different networks, all modified for the retrieval task, when the size of the input image is 362 ˆ362. Compared with large models, lightweight models significantly reduce the computation during inference phase. Thus, they can be used in various resource-constrained scenarios. Testing Details. During testing, as for ROxf and RPar datasets, we resize images so that the larger dimension is equal to 1024 pixels and preserve the aspect ratio. Besides, the image features are extracted at three scales, i.e., t1{ ? 2, 1, ? 2u. We perform L 2 normalization for each scale independently and the features of three scales are averaged, followed by another L 2 normalization. Under the asymmetric retrieval setting, queries are embedded with the lightweight query model φ q p¨q, while the gallery images are embedded by a large model φ g p¨q.

B EXTENDED COMPARISONS B.1 DIFFERENT QUERY MODEL

In this section, we perform more extensive comparisons with existing methods. Specifically, in Tab. 8, we take EfficientNetB3 as the lightweight query model. The performance of asymmetric retrieval improves as the capacity of the query model becomes larger (compared with the accuracy of MobileNetV2 in Tab. 6). Our proposed methods achieve optimal performance in various settings, with the performance of asymmetric retrieval being almost comparable to that of symmetric retrieval.

B.2 DIFFERENT GALLERY MODEL

In Tab. 9, we further extend the experiments in Fig. 5 in the main paper. Three recent deep representation models are adopted to embed gallery images. (1) Token (Wu et al., 2022a) based on local feature aggregation. (2) CVNet (Lee et al., 2022) based on global pooling. (3) DOLG (Yang et al., 2021) 

C ADDITIONAL ABLATIONS C.1 IMPACT OF DISTANCE TYPE

In Rank Order Preservation (Sec. 4.1), the weighted mean square error is token as the final objective function to train the query model end to end. In the section, we explore an alternative to measure the distance between the Indicator matrices I g and I q . Specifically, we convert each row of I g and I q into the form of probability distribution: P g i,j " exp `Ig i,j {τ g řK l"1 exp ´Ig i,l {τ g ¯, P q i,j " exp `Iq i,j {τ q řK l"1 exp ´Iq i,l {τ q ¯, i " 1, 2, ¨¨¨, K, where τ g and τ q are the temperatures. Then, the distance between two indicator matrices is defined as the KL divergence of two distributions: L ROP " K ÿ i"1 K ÿ j"1 W i P g i,j logp P g i,j P q i,j q. ( ) As for Monotonic Similarity Preservation (Sec. 4.2), we try to use the L 2 distance to measure the inconsistency between the mapped similarity scores M g and the asymmetric similarity scores S q . Then, Eq. ( 11) is defined as MobileNetV2 and GeM are adopted as query and gallery models, respectively. Query features are extracted as the original single scale. The x-axis represents the average FLOPs (G) for five inferences, which is proportional to the size of the image. We resize the queries to t0.2, 0.4, 0.6, 1{ ? 2, 0.8, 1.0, ? 2u of the original size (10247 68). L MSP " }M g ´Sq } 2 2 " K ÿ i"1 pM i g ´Si q q 2 . (

D ANALYSIS AND DISCUSSIONS D.1 INFERENCE COMPUTATION

During testing, we follow the common settings to use multi-scale feature extraction, which greatly aggravates the inference computation of the query model. The inference computation is mainly related to the complexity of the model and the size of the test image. In this section, MobileNetV2 is adopted as query model. We use single-scale feature extraction and vary the image size to study the relationship between inference computation and asymmetric retrieval accuracy. The results are shown in Fig. 7 , the asymmetric retrieval accuracy increases and saturates as the inference computation increases. In a practical application scenario, we should choose the appropriate test image size to achieve a balance between efficiency and accuracy.

Gallery

Loss MobileNetV2 is adopted as the query model.

D.2 COMBINE TWO INSTANTIATIONS

In this section, we try to combine the two instantiation methods proposed in the main paper. As shown in Tab. 15, the simple combination fails to bring further performance improvement. We believe this is due to the fact that the optimization goals of both methods are to maintain the order of the images in the returned ranking list. Thus, the final results obtained with two methods are not complementary to each other, which is also confirmed by the distribution of the similarity scores present in Fig. 3 and Fig. 8 .

D.3 SIMILARITY SCORE DISTRIBUTION

In this section, we visualize the similarity score distributions in Fig. 8 , when the query model is trained by different methods. The first row shows the training process of Rank Order Preservation. The following three rows correspond to the training process of Monotonic Similarity Preservation, when three different mapping functions are chosen, respectively. As the training proceeds, the region of the similarity distributions gradually becomes slender, which indicates that the orders of the images in the returned ranking list are better maintained. It can be found that under our rank preserving framework, the different instantiations yield similar results, which is also confirmed by the final retrieval accuracy (Tab. 6). (d) φqp¨q is trained for MSP Eq. ( 11) with learnable logarithmic mapping function. Figure 8 : Visualization of the similarity score distributions. SfM-120k, MobileNetV2 and GeM are adopted as the training dataset, query and gallery models, respectively. s g : symmetric similarity score; s q : asymmetric similarity score.



Figure1: An overview of Rank preserving framework. Given a training image x, gallery model φ g p¨q and query model φ q p¨q encode it into features g and q, respectively. g is treated as the query to search in a training gallery set G t , of which the images are also embedded by φ g p¨q. We fetch the features F K of the top-K images in the ranking list and calculate the asymmetric similarity scores S q with q. Two instantiations of our framework named Ranking Order Preservation (Sec. 4.1) and Monotonic Similarity Preservation (Sec. 4.2) are proposed to ensure the consistency of rank orders when the query is embedded by φ q p¨q and φ g p¨q, respectively.

Figure 2: (Left) Heaviside step function and two sigmoid functions with different tempurature τ as approximations. (Right) Corresponding derivatives.

Figure

Figure 7: Performance versus inference computation complexity when varying the size of input image.

Analysis of the length of the ranking list K in our methods. SfM-120k, GeM and MobileNetV2 are adopted as training set, gallery and query models, respectively. Analysis of temperature τ in Eq. (9). τ controls the smoothness of the sigmoid function used to approximate the Heaviside step function.

Analysis of ranking weight Wi in

Comparison of different unsupervised methods trained on the deployed gallery set. : denotes the gallery set of that testing dataset. MobileNetV2 and GeM are adopted as query and gallery models, respectively. Reg:(Budnik & Avrithis, 2021); CSD:(Wu et al., 2022b).

Comparison to the state-of-theart methods on INSTRE and GLDv2-Test. Contr ˚:

Analysis of different mapping functions. SfM-120k, MobileNetV2 and GeM are adopted as training set, query and gallery models, respectively. Net φqp¨q Net φgp¨q Medium Hard Medium Hard Medium Hard Medium Hard Ours 64.33 39.65 42.33 19.72 76.33 54.65 47.78 20.69 Ours ˚65.22 39.50 43.47 20.20 76.31 54.82 47.56 20.33 Training with DELG as gallery model and GLDv2 as training set Reg (Budnik & Avrithis, 2021) Ours 79.11 59.44 64.34 39.19 89.08 76.78 72.00 48.47 Ours ˚77.55 58.54 64.18 40.19 88.74 76.04 71.04 48.65

comparison (asymmetric retrieval) to the state-of-the-art methods. ; : the same gallery models as comparison methods; R101: ResNet101; Mv2: MobileNetV2; Black bold: the best performance. See App. B for more comparisons.

R1M NET φqp¨q NET φgp¨q Medium Hard Medium Hard Medium Hard Medium Hard

mAP (asymmetric retrieval) comparison to the state-of-the-art methods. DELG and GeM are trained with SfM-120k and GLDv2, respectively. ; : gallery models are the same as comparison methods; R101: ResNet101; Eb3: EfficientNetB3. Black bold denotes the best performance.

based on local and global feature fusion. Our methods achieve the best performance across various settings, which demonstrates the generalization of them. R1M NET φ q p¨q NET φ g p¨q Medium Hard MediumHard Medium Hard Medium Hard Ours ˚78.46 60.74 66.98 43.84 87.81 74.72 72.18 53.63 Extend mAP (asymmetric retrieval) comparison to the state-of-the-art methods with different gallery models. All the models are trained with GLDv2. ; : re-evaluate the official public weights; : : our re-implementation. R101: ResNet101; Mv2: MobileNetV2. Black bold denotes the best performance.

Eq.(9) + λ 2 Eq.(11) 24.84 34.84 26.56 64.56 40.79 71.64 48.02 Ablation on the combination of different training losses. SfM-120k and GLDv2 are adopted for training the query model when GeM and DELG serve as the gallery model, respectively.

φqp¨q is trained for ROP Eq. (9).

7. ACKNOWLEDGMENTS

This work was supported in part by the National Natural Science Foundation of China under Contract 62102128 and 62021001, and in part by the Fundamental Research Funds for the Central Universities under contract WK3490000007. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Appendix

In this appendix, we firstly present the pseudo-code of our methods, more details about training, testing and model architectures (App. A). Then, we conduct more comparisons (App. B) to demonstrate the superiority of our methods in various settings. Finally, more extended ablations (App. C) are conducted for an in-depth understanding of our framework (App. D).Algorithm 1 Pseudo-code of Rank Preserving Framework in a PyTorch-like style.# gallery_model: well-trained and fixed encoder for gallery set, no gradient # query_model: lightweight query encoder, with gradient # topk: the length of ranking list, K in Equ.( 5 

A.1 EXPERIMENT SETTINGS

Training Details. When SfM-120k is adopted for training, we follow the common settings in AML (Budnik & Avrithis, 2021) . Training images are resized with the maximum side equal to 512, keeping the aspect ratio. No data augmentation is adopted. Training epochs and batch size are set as 10 and 64, respectively. The query model is trained on one NVIDIA RTX 3090 GPU. When using GLDv2 as the training set, 512 ˆ512 pixels are center cropped from the randomly resized image. Random color jittering is adopted as the data augmentation. We train the query model on 4 NVIDIA RTX 3090 GPUs for 10 epochs with a batch size of 256. All models are optimized using Adam with an initial learning rate of 10 ´3 and a weight decay of 10 ´6. A linearly decaying scheduler is adopted to gradually decay the learning rate to 0 when the desired number of steps is reached. When query model is trained with Rank Order Preservation (Sec. 4.1), the length K of ranking list is set to 4096, and the temperature coefficient τ r in ranking weight W i is set as 0.2. As for Monotonic Similarity Preservation (Sec. 4.2), K is also set as 4096, and both τ g and τ q are set to 0.1. Model Architecture Details. In Fig. 6 , we show the typical differences in the model architecture for the classification task and image retrieval task. To obtain better transfer performance, most of the models are pre-trained on ImageNet (Deng et al., 2009) with category labels. These models usually consist of a feature extractor, a global mean pooling, several fully-connected layers, and a classification layer. To adapt these models for the image retrieval task, we keep only the feature extractor and discard the other layers. After that, GeM pooling (Radenović et al., 2018b ) is adopted for aggregating the feature map output by the feature extractor. Finally, a whitening layer, implemented by a fully-connected layer, is adopted to obtain the final global feature. Notably, the whitening layer Polynomial function is adopted as the mapping function. MobileNetV2 and GeM are adopted as query and gallery models, respectively.The comparison of different distance type are summarized in Tab. 10, the performance of various loss types are similar, which shows that rank preserving is the key to achieve superior performance rather than a specific consistency loss.

C.2 IMPACT OF THE UPDATE FUNCTION

In Tab. 11 and Tab. 12, we show an extend version of Tab. 3 in the main paper, where the specific parameter values are given. As for the logarithmic and exponential functions, we also consider manually defining the bases of the mapping functions. As shown in Tab. 11, both learnable bases and manual defined bases lead to similar results. For the polynomial functions, we consider various settings of the basis functions, i.e., tx 1{2α u 6 α"1 ,tx 1{3α u 9 α"1 ,tx 1{4α u 12 α"1 . Note that the parameters ta 1 , a 2 , ¨¨¨, a N u in this case are all learned from the data distribution. It can be seen that superior performance is achieved with various settings of the basis functions when the order of the basis functions is properly chosen.

C.3 EXTENDED ABLATION ON TRAINING DATASETS

In this section, we study the scalability of our framework. we randomly sample different number of images from GLDv2 dataset for training. As shown in Tab. 13, more training data leads to better performance across all settings. In Tab. 13, we conduct more experiments with unlabeled gallery set for training. When evaluating on a deployed database, training on it is always better than on a collected training dataset, which is also confirmed by Tab. 14 in the main paper.

