ACTIVE IMAGE INDEXING

Abstract

Image copy detection and retrieval from large databases leverage two components. First, a neural network maps an image to a vector representation, that is relatively robust to various transformations of the image. Second, an efficient but approximate similarity search algorithm trades scalability (size and speed) against quality of the search, thereby introducing a source of error. This paper improves the robustness of image copy detection with active indexing, that optimizes the interplay of these two components. We reduce the quantization loss of a given image representation by making imperceptible changes to the image before its release. The loss is back-propagated through the deep neural network back to the image, under perceptual constraints. These modifications make the image more retrievable. Our experiments show that the retrieval and copy detection of activated images is significantly improved. For instance, activation improves by +40% the Recall1@1 on various image transformations, and for several popular indexing structures based on product quantization and locality sensitivity hashing.

1. INTRODUCTION

The traceability of images on a media sharing platform is a challenge: they are widely used, easily edited and disseminated both inside and outside the platform. In this paper, we tackle the corresponding task of Image Copy Detection (ICD), i.e. finding whether an image already exists in the database; and if so, give back its identifier. ICD methods power reverse search engines, photography service providers checking copyrights, or media platforms moderating and tracking down malicious content (e.g. Microsoft's PhotoDNA (2009) or Apple's NeuralHash (2021)). Image identification systems have to be robust to identify images that are edited (cropping, colorimetric change, JPEG compression . . . ) after their release (Douze et al., 2021; Wang et al., 2022) . The common approach for content-based image retrieval reduces images to high-dimensional vectors, referred to as representations. Early representations used for retrieval were hand-crafted features such as color histograms (Swain & Ballard, 1991) , GIST (Oliva & Torralba, 2001) , or Fisher Vectors (Perronnin et al., 2010) . As of now, a large body of work on self-supervised learning focuses on producing discriminative representations with deep neural networks, which has inspired recent ICD systems. In fact, all submissions to the NeurIPS2021 Image Similarity challenge (Papakipos et al., 2022) exploit neural networks. They are trained to provide invariance to potential image transformations, akin to data augmentation in self-supervised learning. Scalability is another key requirement of image similarity search: searching must be fast on largescale databases, which exhaustive vector comparisons cannot do. In practice, ICD engines leverage approximate neighbor search algorithms, that trade search accuracy against scalability. Approximate similarity search algorithms speed up the search by not computing the exact distance between all representations in the dataset (Johnson et al., 2019; Guo et al., 2020) . First they lower the number of scored items by partitioning the representation space, and evaluate the distances of only a few subsets. Second, they reduce the computational cost of similarity evaluation with quantization or binarization. These mechanisms make indexing methods subject to the curse of dimensionality. In particular, in high-dimensional spaces, vector representations lie close to boundaries of the partition (Böhm et al., 2001) . Since edited versions of an original image have noisy vector representations, they sometimes fall into different subsets or are not quantized the same way by the index. All in all, it makes approximate similarity search very sensitive to perturbations of the edited image representations, which causes images to evade detection. In this paper, we introduce a method that improves similarity search on large databases, provided that the platform or photo provider can modify the images before their release (see Fig. 1 ). We put the popular saying "attack is the best form of defense" into practice by applying image perturbations and drawing inspiration from adversarial attacks. Indeed, representations produced with neural networks are subject to adversarial examples (Szegedy et al., 2013) : small perturbations of the input image can lead to very different vector representations, making it possible to create adversarial queries that fool image retrieval systems (Liu et al., 2019; Tolias et al., 2019; Dolhansky & Ferrer, 2020) . In contrast, we modify an image to make it more indexing friendly. With minimal changes in the image domain, the image representation is pushed towards the center of the indexing partition, rising the odds that edited versions will remain in the same subset. This property is obtained by minimizing an indexation loss by gradient descent back to the image pixels, like for adversarial examples. For indexing structures based on product quantization (Jegou et al., 2010) , this strategy amounts to pushing the representation closer to its quantized codeword, in which case the indexation loss is simply measured by the reconstruction error. Since the image quality is an important constraint here, the perturbation is shaped by perceptual filters to remain invisible to the human eye. Our contributions are: • a new approach to improve ICD and retrieval, when images can be changed before release; • an adversarial image optimization scheme that adds minimal perceptual perturbations to images in order to reduce reconstruction errors, and improve vector representation for indexing; • experimental evidence that the method significantly improves index performance.

2. PRELIMINARIES: REPRESENTATION LEARNING AND INDEXING

For the sake of simplicity, the exposure focuses on image representations from SSCD networks (Pizzi et al., 2022) and the indexing technique IVF-PQ (Jegou et al., 2010) , since both are typically used for ICD. Extensions to other methods can be found in Sec. 5.4.

2.1. DEEP DESCRIPTOR LEARNING

Metric embedding learning aims to learn a mapping f : R c×h×w → R d , such that measuring the similarity between images I and I ′ amounts to computing the distance ∥f (I) -f (I ′ )∥. In recent works, f is typically a neural network trained with self-supervision on raw data to learn metrically meaningful representations. Methods include contrastive learning (Chen et al., 2020) , selfdistillation (Grill et al., 2020; Caron et al., 2021) , or masking random patches of images (He et al., 2022; Assran et al., 2022) . In particular, SSCD (Pizzi et al., 2022 ) is a training method specialized for ICD. It employs the contrastive self-supervised method SimCLR (Chen et al., 2020) and entropy regularization (Sablayrolles et al., 2019) to improve the distribution of the representations.



Figure1: Overview of the method and latent space representation. We start from an original image Io that can be edited t(•) in various ways: its feature extraction f (t(Io)) spawns the shaded region in the embedding space. The edited versions should be recoverable by nearest neighbor search on quantized representations. In the regular (non-active) case, f (Io) is quantized by the index as . When the image is edited, t(Io) switches cells and the closest neighbor returned by the index is the wrong one . In active indexing: Io is modified in an imperceptible way to generate I ⋆ such that f (I ⋆ ) is further away from the boundary. When edited copies f (t(I ⋆ )) are queried, retrieval errors are significantly reduced.

