LEARNED INDEX WITH DYNAMIC ϵ

Abstract

Index structure is a fundamental component in database and facilitates broad data retrieval applications. Recent learned index methods show superior performance by learning hidden yet useful data distribution with the help of machine learning, and provide a guarantee that the prediction error is no more than a pre-defined ϵ. However, existing learned index methods adopt a fixed ϵ for all the learned segments, neglecting the diverse characteristics of different data localities. In this paper, we propose a mathematically-grounded learned index framework with dynamic ϵ, which is efficient and pluggable to existing learned index methods. We theoretically analyze prediction error bounds that link ϵ with data characteristics for an illustrative learned index method. Under the guidance of the derived bounds, we learn how to vary ϵ and improve the index performance with a better space-time trade-off. Experiments with real-world datasets and several state-of-the-art methods demonstrate the efficiency, effectiveness and usability of the proposed framework.

1. INTRODUCTION

Data indexing (Graefe & Kuno, 2011; Wang et al., 2018; Luo & Carey, 2020; Zhou et al., 2020) , which stores keys and corresponding payloads with designed structures, supports efficient query operations over data and benefits various data retrieval applications. Recently, Machine Learning (ML) models have been incorporated into the design of index structure, leading to substantial improvements in terms of both storage space and querying efficiency (Kipf et al., 2019; Ferragina & Vinciguerra, 2020a; Vaidya et al., 2021) . The key insight behind this trending topic of "learned index" is that the data to be indexed contain useful distribution information and such information can be utilized by trainable ML models that map the keys {x} to their stored positions {y}. To approximate the data distribution, state-of-the-art (SOTA) learned index methods (Galakatos et al., 2019; Kipf et al., 2020; Ferragina & Vinciguerra, 2020b; Stoian et al., 2021) propose to learn piece-wise linear segments S = [S 1 , ..., S i , ..., S N ], where S i : y = a i x + b i is the linear segment parameterized by (a i , b i ) and N is the total number of learned segments. These methods introduce an important pre-defined parameter ϵ ∈ Z >1 to guarantee the worst-case preciseness: |S i (x) -y| ≤ ϵ for i ∈ [N ]. By tuning ϵ, various space-time preferences from users can be met. For example, a relatively large ϵ can result in a small index size while having large prediction errors, and on the other hand, a relatively small ϵ provides users with small prediction errors while having more learned segments and thus a large index size. Existing learned index methods implicitly assume that the whole dataset to be indexed contains the same characteristics for different localities and thus adopt the same ϵ for all the learned segments. However, the scenario where there is varied local data distribution, is very common, leading to sub-optimal index performance of existing methods. For example, the real-world Weblog dataset used in our experiment has typically non-linear temporal patterns caused by online campus transactions such as class schedule arrangements, weekends and holidays. More importantly, the impact of ϵ on index performance is intrinsically linked to data characteristics, which are not fully explored and utilized by existing learned index methods. Motivated by these, in this paper, we theoretically analyze the impact of ϵ on index performance, and link the characteristics of data localities with the dynamic adjustments of ϵ. Based on the derived theoretical results, we propose an efficient and pluggable learned index framework that dynamically adjusts ϵ in a principled way. To be specific, under the setting of an illustrative learned index method MET (Ferragina et al., 2020) , we present novel analyses about the prediction error bounds of each segment that link ϵ with the mean and variance of data localities. The segment-wise prediction error embeds the space-time trade-off as it is the product of the number of covered keys and mean absolute error, which determine the index size and preciseness respectively. The derived mathematical relationships enable our framework to fully explore diverse data localities with an ϵ-learner module, which learns to predict the impact of ϵ on the index performance and adaptively choose a suitable ϵ to achieve a better space-time trade-off. We apply the proposed framework to several SOTA learned index methods, and conduct a series of experiments on three widely adopted real-world datasets. Compared with the original learned index methods with fixed ϵ, our dynamic ϵ versions achieve significant index performance improvements with better space-time trade-offs. We also conduct various experiments to verify the necessity and effectiveness of the proposed framework, and provide both ablation study and case study to understand how the proposed framework works. Our contributions can be summarized as follows: • We make the first step to exploit the potential of dynamically adjusting ϵ for learned indexes, and propose an efficient and pluggable framework that can be applied to a broad class of piece-wise approximation algorithms. • We provide theoretical analysis for a proxy task modeling the index space-time trade-off, which establishes our ϵ-learner based on the data characteristics and the derived bounds. • We achieve significant index performance improvements over several SOTA learned index methods on real-world datasets. To facilitate further studies, we make our codes and datasets public at https://github.com/yxdyc/Learned-Index-Dynamic-Epsilon.

2. BACKGROUND

Learned Index. Given a dataset D = {(x, y)|x ∈ X , y ∈ Y}, X is the set of keys over a universe U such as reals or integers, and Y is the set of positions where the keys and corresponding payloads are stored. The index such as B + -tree (Abel, 1984) aims to build a compact structure to support efficient query operations over D. Typically, the keys are assumed to be sorted in ascending order to satisfy the key-position monotonicity, i.e., for any two keys, x i > x j iff their positions y i > y j , such that the range query (X ∩ [x low , x high ]) can be handled. Recently, learned index methods (Kraska et al., 2018; Li et al., 2019; Tang et al., 2020; Dai et al., 2020; Crotty, 2021) leverage ML models to mine useful distribution information from D, and incorporate such information to boost the index performance. To look up a given key x, the learned index first predicts position ŷ using the learned models, and subsequently finds the stored true position y based on ŷ with a binary search or exponential search. By modeling the data distribution information, learned indexes achieve faster query speed and much smaller storage cost than B + -tree index traditional, with different optimization aspects such as on ϵ-bounded linear approximation (Galakatos et al., 2019; Ferragina & Vinciguerra, 2020b; Kipf et al., 2020; Marcus et al., 2020; Li et al., 2021b) and data-layout (Ding et al., 2020; Wu et al., 2021; Zhang & Gao, 2022; Wu et al., 2022; Li et al., 2021a) . ϵ-bounded Linear Approximation. Many existing learned index methods adopt piece-wise linear segments to approximate the distribution of D due to their effectiveness and low computing cost, and introduce the parameter ϵ to provide a worst-case preciseness guarantee and a tunable knob to meet various space-time trade-off preferences. Here we briefly introduce the SOTA ϵ-bounded learned index methods that are most closely to our work, and refer readers to the literature (Ferragina & Vinciguerra, 2020a; Marcus et al., 2020; Stoian et al., 2021) for details of other methods. We first describe an illustrative learned index algorithm MET (Ferragina et al., 2020) . Specifically, for any two consecutive keys of D, suppose their key interval (x i -x i-1 ) is drawn according to a random process {G i } i∈N , where G i is a positive independent and identically distributed (i.i.d.) random variable whose mean is µ and variance is σ 2 . MET learns linear segments {S i : y = a i x + b i } via a simple deterministic strategy: the current segment fixes the slope a i = 1/µ, goes through the first available data point and thus b i is determined. Then S i covers the remaining data points one by one until a data point (x ′ , y ′ ) gains the prediction error larger than ϵ. The violation triggers a new linear segment that begins from (x ′ , y ′ ) and the process repeats until D has been traversed.



* The first two authors contributed equally to this work. † Corresponding author. ‡ Work was done at Alibaba.

