SEARCH DATA STRUCTURE LEARNING

Abstract

In our modern world, an enormous amount of data surrounds us, and we are rarely interested in more than a handful of data points at once. It is like searching for needles in a haystack, and in many cases, there is no better algorithm than a random search, which might not be viable. Previously proposed algorithms for efficient database access are made for particular applications such as finding the min/max, finding all points within a range or finding the k-nearest neighbours. Consequently, there is a lack of versatility concerning what we can search when it comes to a gigantic database. In this work, we propose Search Data Structure Learning (SDSL), a generalization of the standard Search Data Structure (SDS) in which the machine has to learn how to search in the database. To evaluate approaches in this field, we propose a novel metric called Sequential Search Work Ratio (SSWR), a natural way of measuring a search's efficiency and quality. Finally, we inaugurate the field with the Efficient Learnable Binary Access (ELBA), a family of models for Search Data Structure Learning. It requires a means to train two parametric functions and a search data structure for binary codes. For the training, we developed a novel loss function, the F-beta Loss. For the SDS, we describe the Multi-Bernoulli Search (MBS), a novel approach for probabilistic binary codes. Finally, we exhibit the F-beta Loss and the MBS synergy by experimentally showing that it is at least twice as better than using the alternative loss functions of MIHash and HashNet and twenty times better than with another SDS based on the Hamming radius.

1. INTRODUCTION

In many applications, the machines need to perform many searches in a gigantic database where the number of relevant documents is minuscule, e.g. ten in a billion. It is like searching for some needles in a haystack. In those cases, considering every document is extremely inefficient. For productivity, the search should not consider the whole database. Traditionally, this is accomplished by building a search data structure and seeking within it. Those data structures can take many forms. For example, there are tree-based structures such as the B-Tree (Bayer & McCreight, 1970) , the k-d tree (Friedman et al., 1977) , the R-Tree (Guttman, 1984) or the M-Tree (Ciaccia et al., 1997) to name a few. In addition to trees, KNNG (Paredes & Chávez, 2005) build a graph designed for the k-nearest neighbour search. Later approaches improve on KNNG, both for construction and search time and for the search quality itself. In those lines, there is Efanna (Fu & Cai, 2016), HNSW (Malkov & Yashunin, 2018) and ONNG (Iwasaki & Miyazaki, 2018) . One of the most common types of search data structures is the hash table. It is so useful that it is implemented natively in programming languages such as Python (with the dictionary type). Hash table is often the main tool an application will use for efficiency. For example, from a short and noisy song sample, Shazam (Wang et al., 2003) can retrieve the whole song by using hash tables filled with well-designed fingerprints of each song. Traditionally, the design of a search data structure was for a particular type of search. For example, hash tables can retrieve documents very quickly, even in gigantic databases. However, the query must be equal to the key. This requirement makes the hash table not always applicable. For instance, if the database is indexed by date and time and we seek all documents from a specific day, then it might not be optimal to query every second of that day with an equality search. B-Tree (Bayer & McCreight, 1970) was precisely introduced for applications where a range search is preferable (and faster insertion than dichotomic search is needed). Equality and range are far from being the only

