APPROXIMATE NEAREST NEIGHBOR SEARCH THROUGH MODERN ERROR-CORRECTING CODES

Abstract

A locality-sensitive hash (or LSH) is a function that can efficiently map dataset points into a latent space while preserving pairwise distances. Such LSH functions have been used in approximate nearest-neighbor search (ANNS) in the following classic way, which we call classic hash clustering (CHC): first, the dataset points are hashed into a low-dimensional binary space using the LSH function; then, the points are clustered by these hash values. Upon receiving a query, its nearest neighbors are sought within its hash-cluster and nearby hash-clusters (i.e., multiprobe). However, CHC mandates a low-dimensional latent space for the LSH function, which distorts distances from the (high-dimensional) original real space; this results in inferior recall. This is often mitigated through using multiple hash tables at additional storage and memory costs. In this paper, we introduce a better way of using LSH functions for ANNS. Our method, called the Polar Code Nearest-Neighbor (PCNN) algorithm, uses modern error-correcting codes (specifically polar codes) to maintain a manageable number of clusters inside a high-dimensional latent space. Allowing the LSH function to embed into this high-dimensional latent space results in higher recall, as the embedding faithfully captures distances in the original space. The crux of PCNN is using polar codes for probing: we present a multi-probe scheme for PCNN which uses efficient list-decoding methods for polar codes, with time complexity independent of the dataset size. Fixing the choice of LSH, experiment results demonstrate significant performance gains of PCNN over CHC; in particular, PCNN with a single table outperforms CHC with multiple tables, obviating the need for large memory and storage.

1. INTRODUCTION

In similarity search, one is first given a dataset D of points, then a set of query points from the same space. For each query, the goal is to find the closest point (or points) in D to that query, according to some given metric. The simplest way to find these nearest neighbors of a query is to calculate the distance of the query from each point in the dataset D; however, when the dataset D is large, this linear cost in the dataset's size is prohibitive. Thus, upon query, one would like to consider only a small subset of D. Since these non-exhaustive algorithms do not consider all points in the dataset, we are interested in approximate similarity search, with the following relaxations: • We allow an approximation ratio, i.e., the algorithm is allowed to return neighbors whose distance to the query is at most some factor α ≥ 1 times the distance of the nearest neighbor to the queryfoot_0 . • Since the algorithm does not explore all points, it can sometimes return a result which is not within the desired distance; the fraction of results which are within the desired range is called the recall of the algorithm. Clustering Methods. A common technique for approximate similarity search is to divide the dataset D into clusters. Then, upon receiving a query, the algorithm would only search the points of D that appear in the clusters which are closest to the query. (If the clusters are represented by points in the original space, the distance to the query is well defined. Otherwise, a different metric is needed.) Algorithms based on such clustering are often used in practice, since they allow storing different clusters on different machines (i.e., sharding) for efficient distributed processing of queries. In light of these benefits of clustering methods, we would like to study them in our approximate setting. However, existing clustering methods have some drawbacks in this setting, which motivate the algorithm presented in this paper. We now explore two popular clustering methods and describe their drawbacks. Clustering by Training Cluster Centers. In this method, introduced by Sivic & Zisserman (2003), cluster centers are trained on the dataset (or some sample of the dataset) using a clustering algorithm (namely k-means), and each dataset point is mapped to a cluster (e.g., by the nearest cluster center). Upon query, the clusters corresponding to the closest cluster centers are searched. This common algorithm is part of the popular Faiss library (Johnson et al., 2021) as the default inverted-file (IVF) method; we henceforth refer to this method as IVF. Since the cluster centers are unstructured, finding these closest centers requires a number of distance computations which is linear in the number of centers. The total number of distance computations (for both centers and dataset points) is therefore always at least the square root of the dataset's size. Thus, while popular in general, this method is less appropriate for an approximate regime in a large dataset, as we would like to get high recall using a much smaller computational cost, independent of the dataset size. Clustering by Locality-Sensitive Hashing. Another such clustering method uses Locality-Sensitive Hashing, or LSH (Indyk & Motwani, 1998) . In this method, a locality-sensitive hash h : R d → {0, 1} nbit is used to map the dataset to hash codes (nbit-bit strings). Each such hash-code identifies a cluster which contains the dataset points in the hash-code's preimage. Upon receiving a query q, the hashcode h(q) is calculated, and closest points are search only within the clusters identified with the closest hashcodes to h(q) (Lv et al., 2007) . We refer to this simple method, which is used in most classic LSH papers, as Classic Hash Clustering, or CHC for short. Note that the choice of the LSH function h is independent from the operation of CHC, and should be chosen such that distances in the embedding space approximate distances in the original space; two possible choices (which we also consider in this paper) are hyperplane LSH (Charikar, 2002) and the data-dependent autoencoder LSH (Tissier et al., 2019) . The fact that CHC uses LSH functions provides some advantages. First, the closest clusters to a query can be found without calculating its distance to every cluster; this makes CHC more suitable for the approximate regime than IVF. Second, the index in CHC can be easily augmented with additional dataset points in an online fashion, as the clustering is not trained on the dataset. In addition, CHC usually has a low memory/storage footprint. However, using CHC does not usually achieve high recall; this is usually alleviated by using multiple tables (i.e., multiple clusterings), at significant memory and storage costs. Why does CHC achieve low recall? A possible explanation could be that the distances between the dataset/query points in high-dimensional space R d are not faithfully captured by the hashing to the space {0, 1} nbit . This is since the hash-code space must be low-dimensional, as memory and running time restrictions make it infeasible to use large nbit. For example, if one chooses nbit = 40 (and thus 2 40 clusters) for a dataset of a billion points, 99.9% of the clusters would be empty; Thus, nbit must remain small, and usually does not exceed 32. However, this severely limits the granularity of distances, as Hamming distances in this low-dimensional binary space only take on one of 32 nonzero values. This lack of granularity is a property of every low-dimensional embedding, and thus appears in all LSH functions (including data-dependent functions). In addition, using a low number of embedding bits could yield a high variance in the distance of any embedded pair of points. For example, consider hyperplane LSH: in this method, the expected relative Hamming distance of the hash-codes is equal to the relative angular distance between the original points (Charikar, 2002) , but the bits of the hashcode are generated independently. Thus, the deviations from the expectation are very significant when the number of bits is low, as mandated by CHC. These drawbacks of CHC thus call for a different technique, which is able to simultaneously utilize distance information from a high-dimensional binary embedding, as well as preserve a reasonable number of clusters. Our Contributions. In this paper, we present a generalization of CHC which uses modern errorcorrecting codes (ECCs); we call this method the Polar Code Nearest Neighbor algorithm (or PCNN). PCNN encapsulates any LSH method H, similar to CHC, but yields superior performance. CHC uses H to embed into a nbit-dimensional binary space for some low nbit (e.g., nbit = 30); PCNN instead uses H to embed into a cdim-dimensional binary space, for some cdim ≫ nbit (e.g., cdim = 512). Then, the embedded dataset in PCNN is clustered, such that the set of clusters forms a nbit-dimensional subspace inside the larger cdim-dimensional embedding space. Upon query, the



For similarity measures, which should be maximized, we would instead be interested in α ≤ 1.

