BENCHMARKING APPROXIMATE K-NEAREST NEIGHBOUR SEARCH FOR BIG HIGH DIMENSIONAL DYNAMIC DATA

Abstract

Approximate k-Nearest Neighbour (ANN) methods are commonly used for mining information from big high-dimensional datasets. For each application the high-level dataset properties and run-time requirements determine which method will provide the most suitable tradeoffs. However, due to a significant lack of comprehensive benchmarking, judicious method selection is not currently possible for ANN applications that involve frequent online changes to datasets. Here we address this issue by building upon existing benchmarks for static search problems to provide a new benchmarking framework for big high dimensional dynamic data. We apply our framework to dynamic scenarios modelled after common real world applications. In all cases we are able to identify a suitable recall-runtime tradeoff to improve upon a worst-case exhaustive search. Our framework provides a flexible solution to accelerate future ANN research and enable researchers in other online data-rich domains to find suitable methods for handling their ANN searches. 1 

1. INTRODUCTION

Approximate k-Nearest Neighbour (ANN) search is a widely applicable technique for tractably computing local statistics over large datasets of high dimensional discrete samples (Beyer et al., 1999) . ANN methods achieve sub-linear search times by trading off search accuracy and runtime. ANN methods are applied in many domains such as image retrieval, robotic localisation, cross-modal search and other semantic searches (Prokhorenkova & Shekhovtsov, 2020) . ANN search is well suited to applications where an index structure can be precomputed over a static dataset to then provide a suitable recall-runtime tradeoff. Achieving an optimal tradeoff for specific dataset properties and application requirements relies on hyperparameter tuning for each suitable ANN method. For instance, graph based indexes can be tuned to achieve high search accuracy while quantisation methods are better suited to perform faster searches with less exact results. In practice, when given a new dataset, there is a significant computational cost to evaluate and select the best performing ANN method. Several ANN benchmarks have been established to guide the selection and parameter tuning required for achieving tractable searches (Aumüller et al., 2017; Matsui, 2020) . However, current ANN benchmarks focus on static search problems and cannot be used to inform if any ANN methods are suitable for tackling dynamic search problems where the indexed dataset changes over time. We observe that current ANN benchmarks do not generalise to dynamic search problems because they perform index construction as an offline process that optimises for search performance on a fixed dataset (Figure 1a ). This fails to address the requirements of growing fields such as Machine Learning (ML) where there is a strong need for tractable k-nearest neighbour search on large dynamic sets of high dimensional samples (Prokhorenkova & Shekhovtsov, 2020) . For example, local statistics can be extracted by computing neighbourhoods in an embedding space of a learning process, but computing these neighbourhoods requires frequent evaluation of sample locations in the highly dynamic embedding. Due to the lack of suitable ANN benchmarks, achieving tractable search performance currently requires extensive evaluation and tuning on the already computationally expensive and highly parameterised systems that could utilise dynamic ANN search. Here, we address this gap and present a novel ANN benchmarking framework for dynamic search problems. Unlike existing benchmarking of ANN search we include the computational costs of constructing and maintaining an index structure throughout a dynamic process. The main contributions of our work are as follows: • We present a novel characterisation of the complexity and dynamic variations for ANN search on big high dimensional (∼ 100 dimensions) dynamic datasets (Section 2). From this, we generate benchmarks that model domain specific applications in two key categories of dynamic search problems: online data collection and online feature learning (Figure 1b ). • We establish the baseline performance of five promising ANN methods using extended hyperparameter sets to better address the requirements of dynamic search problems (Section 5). We discover that ANN methods such as ScaNN (Guo et al., 2020) and HNSW (Malkov & Yashunin, 2018) can outperform an exhaustive search despite online index overheads. • We show that our benchmarking framework can successfully identify which ANN method is best suited for a given dynamic search problem. Our framework generates the key tradeoffs for selecting a suitable ANN method and can be extended to additional dynamic search problems and for future ANN research.

2. PROPOSED CATEGORISATION OF DYNAMIC SEARCH PROBLEMS

In this section, we identify two key categories of dynamic search problems based on their requirements: online data collection and online feature learning. We also identify key measures that characterise specific instances of both static and dynamic search problems. ANN search with online data collection or online feature learning requires online index construction (Figure 1b ). A major practical advantage of online index construction is that it allows for search information to be fed back in a closed loop fashion. To categorise dynamic search problems, we consider their requirements for online data collection or online feature learning. Autonomous navigation, live internet services and generative learning methods are common examples of online data collection that generate an increasing number of samples over time. An increase in the number of indexed samples will directly increase the computational cost of performing searches. Within many machine learning processes, the training of an embedding space is an example of online feature learning. Updating model parameters during the learning process will update the embedded representation of indexed samples. This update can affect local and global index structures and degrade the performance of subsequent searches. From a database perspective, we match online data collection and online feature learning with the operations of adding new samples and updating existing samples respectively. Our benchmarks are designed around each of these two operations in order to model the range of dynamic search problems we are interested in. The removal of samples is another fundamental database operation which is often heuristically applied to maintain sample diversity while limiting the total sample count. Removing samples can therefore be viewed as a heuristic that is providing a tractability tradeoff. In this research we omit the remove operator to focus on benchmarking the baseline tractability from ANN methods alone.



Code submitted in supplementary materials and will be available publicly on publication



Figure 1: a) Existing ANN benchmarks evaluate performance using a single batch of searches performed on a static index. b) Our framework generalises to dynamic search problems where batches of index updates occur between batches of searches. Our benchmarks provide an improved model for optimising ANN usage in online data collection and online feature learning.

