A GENERAL RANK PRESERVING FRAMEWORK FOR ASYMMETRIC IMAGE RETRIEVAL

Abstract

Asymmetric image retrieval aims to deploy compatible models on platforms of different resources to achieve a balance between computational efficiency and retrieval accuracy. The most critical issue is how to align the output features of different models. Despite the great progress, existing approaches apply strong constraints so that features or neighbor structures are strictly aligned across different models. However, such a one-to-one constraint is too strict to be well preserved for the query models with low capacity. Considering that the primary concern of the users is the rank of the returned images, we propose a generic rank preserving framework, which achieves feature compatibility and the order consistency between query and gallery models simultaneously. Specifically, we propose two alternatives to instantiate the framework. One realizes straightforward rank order preservation by directly preserving the consistency of the sorting results. To make sorting process differentiable, the Heaviside step function in sorting is approximated by the sigmoid function. The other aims to preserve a learnable monotonic mapping relationship between the returned similarity scores of query and gallery models. The mapped similarity scores of gallery model are considered as pseudosupervision to guide the query model training. Extensive experiments on various large-scale datasets demonstrate the superiority of our two proposed methods.

1. INTRODUCTION

In recent years, deep representation learning methods (Babenko et al., 2014; Tolias et al., 2016; 2020) have achieved great progress in image retrieval. Typically, most existing image retrieval tasks belong to symmetric image retrieval, in which a deep representation model is deployed to map both query and gallery images into the same discriminative feature space. During online retrieval, gallery images are ranked by sorting their feature distances, e.g., cosine similarity or Euclidean distance, against query image. To achieve high retrieval accuracy, most existing methods tend to deploy a large powerful representation model. In a real-world visual search system, the gallery side is usually on the cloud-based platforms, which have sufficient resources to deploy large powerful models. As for the query side, e.g., mobile phone or smart camera, its resources are too constrained to meet the demand of deploying large models. To strike a balance between performance and efficiency, it is better to deploy a lightweight model on the query side, while a large one for the gallery side. This setup is denoted as asymmetric image retrieval (Duggal et al., 2021; Budnik & Avrithis, 2021) . For asymmetric retrieval, how to align the embedding spaces of the query and gallery models is the core problem. To this end, BCT (Shen et al., 2020) first introduces feature compatibility learning. Concurrent work AML (Budnik & Avrithis, 2021) learns the query model by contrastive learning with gallery model extracting features for positive and negative samples. Recently, CSD (Wu et al., 2022b) achieves promising results by considering both first-order feature imitation and second-order neighbor similarity preservation during the learning of the query model. Despite the great progress, existing methods enforce the consistency of features or neighbor structures across different models, which is too strict to be well preserved for lightweight query models with low capacity. For users, the order of the returned images plays a more important role than the image features or similarity scores. Strictly enforcing the feature-level one-to-one consistency is not the best choice to achieve better asymmetric retrieval accuracy, while rank preserving deserves more attention. To address the above issues, we propose a general rank preserving framework, which directly optimizes the consistency of rank order to realize the feature compatibility across query and gallery models implicitly. Specifically, for a training image, we first extract its features with the query and gallery models, respectively. Then, the gallery feature is utilized for symmetric retrieval in a database, in which images are also embedded by the gallery model. After that, the ranking list and similarity scores are returned. We select the top K images in the ranking list and calculate their asymmetric similarity scores with the query feature. These asymmetric similarity scores may result in different rank orders from those returned by symmetric retrieval. Thus, two instantiation methods are proposed to achieve the consistency of these two rank orders. The first aims to directly optimize the rank order consistency of the sorting results. To make the sorting process differentiable, the sigmoid function is adopted to approximate the Heaviside step function (Davies, 1978) , which is typically used for numerical comparison. Secondly, we propose to maintain a learnable monotonic mapping relationship between the symmetric and asymmetric similarity scores. A learnable monotonically increasing function is applied to the similarity scores returned by the symmetric retrieval, which will serve as the pseudo-supervision of the query model. Then, we constrain the consistency between the mapped similarity scores and the asymmetric similarity scores to optimize the query model. Notably, both two instantiations preserve the rank order of the images, which are returned by symmetric retrieval. Compared with previous methods, our framework has a unique advantage. It does not constrain the query model to mimic the features or the overall neighbor structures of the gallery model. Instead, it expects that the query model maintains the order of the returned images from symmetric retrieval. Thus, our framework weakens the restriction on the query model with low capacity, and reduces the risk of overfitting during learning the query model. Besides, our framework utilizes no annotation or label of the training data, which makes it flexible and adaptable in various real-world scenarios. To evaluate our approach, comprehensive experiments are conducted on four popular retrieval datasets. Ablations demonstrate the effectiveness and generalizability of our framework. Our approach surpasses the existing state-of-the-art methods by a considerable margin.

2. RELATED WORK

Image Retrieval. Given a large corpus, image retrieval aims to efficiently find the images, which contain the same object or describe the same content with the queries, based on their feature similarities. Most of the traditional image retrieval systems are based on local features (Lowe, 2004; Bay et al., 2006) and bag-of-words (Sivic & Zisserman, 2003; Philbin et al., 2007) representations borrowed from text retrieval. There are also several aggregation methods including VLAD (Jégou et al., 2011 ), Fisher vectors (Perronnin et al., 2010) and ASMK (Tolias et al., 2013) , which are used to aggregate local features into compact representations for efficient search. Recently, with the proposed various pooling methods (Kalantidis et al., 2016; Tolias et al., 2016; Radenović et al., 2018b) and loss functions (Revaud et al., 2019; Deng et al., 2019; Weinzaepfel et al., 2022) , deep learning has greatly improved the performance of image retrieval. Despite the great progress, a large deep model is usually deployed for its optimal performance, which, however, is not applicable in some resource-constrained scenarios. In this work, we focus on asymmetric retrieval, where the query (user) side deploys a lightweight model while the gallery side applies a large model. Feature Compatibility. The core of asymmetric retrieval is to align the features of the query and gallery models, which is also known as feature compatibility. BCT (Shen et al., 2020) first formulates the problem of feature compatible learning and reuses the old classifier for the query model training. AML (Budnik & Avrithis, 2021) achieves the feature compatibility by performing asymmetric contrastive learning between different models. After that, CSD (Wu et al., 2022b) achieves the preservation of neighbor similarities in the embedding space of the gallery model in an unsupervised manner. As for HVS (Duggal et al., 2021) , both parameter and architecture are considered simultaneously in a unified framework, which gives promising performance. Other lines of research follow the model regression problem (Yan et al., 2021; Zhang et al., 2022; Duggal et al., 2022) during gallery model updating, which is also related to feature compatibility. In this work, we start from

