BENCHMARKING APPROXIMATE K-NEAREST NEIGHBOUR SEARCH FOR BIG HIGH DIMENSIONAL DYNAMIC DATA

Abstract

Approximate k-Nearest Neighbour (ANN) methods are commonly used for mining information from big high-dimensional datasets. For each application the high-level dataset properties and run-time requirements determine which method will provide the most suitable tradeoffs. However, due to a significant lack of comprehensive benchmarking, judicious method selection is not currently possible for ANN applications that involve frequent online changes to datasets. Here we address this issue by building upon existing benchmarks for static search problems to provide a new benchmarking framework for big high dimensional dynamic data. We apply our framework to dynamic scenarios modelled after common real world applications. In all cases we are able to identify a suitable recall-runtime tradeoff to improve upon a worst-case exhaustive search. Our framework provides a flexible solution to accelerate future ANN research and enable researchers in other online data-rich domains to find suitable methods for handling their ANN searches. 1 

1. INTRODUCTION

Approximate k-Nearest Neighbour (ANN) search is a widely applicable technique for tractably computing local statistics over large datasets of high dimensional discrete samples (Beyer et al., 1999) . ANN methods achieve sub-linear search times by trading off search accuracy and runtime. ANN methods are applied in many domains such as image retrieval, robotic localisation, cross-modal search and other semantic searches (Prokhorenkova & Shekhovtsov, 2020) . ANN search is well suited to applications where an index structure can be precomputed over a static dataset to then provide a suitable recall-runtime tradeoff. Achieving an optimal tradeoff for specific dataset properties and application requirements relies on hyperparameter tuning for each suitable ANN method. For instance, graph based indexes can be tuned to achieve high search accuracy while quantisation methods are better suited to perform faster searches with less exact results. In practice, when given a new dataset, there is a significant computational cost to evaluate and select the best performing ANN method. Several ANN benchmarks have been established to guide the selection and parameter tuning required for achieving tractable searches (Aumüller et al., 2017; Matsui, 2020) . However, current ANN benchmarks focus on static search problems and cannot be used to inform if any ANN methods are suitable for tackling dynamic search problems where the indexed dataset changes over time. We observe that current ANN benchmarks do not generalise to dynamic search problems because they perform index construction as an offline process that optimises for search performance on a fixed dataset (Figure 1a ). This fails to address the requirements of growing fields such as Machine Learning (ML) where there is a strong need for tractable k-nearest neighbour search on large dynamic sets of high dimensional samples (Prokhorenkova & Shekhovtsov, 2020) . For example, local statistics can be extracted by computing neighbourhoods in an embedding space of a learning process, but computing these neighbourhoods requires frequent evaluation of sample locations in the highly dynamic embedding. Due to the lack of suitable ANN benchmarks, achieving tractable 1 Code submitted in supplementary materials and will be available publicly on publication 1

