LEARNED INDEX WITH DYNAMIC ϵ

Abstract

Index structure is a fundamental component in database and facilitates broad data retrieval applications. Recent learned index methods show superior performance by learning hidden yet useful data distribution with the help of machine learning, and provide a guarantee that the prediction error is no more than a pre-defined ϵ. However, existing learned index methods adopt a fixed ϵ for all the learned segments, neglecting the diverse characteristics of different data localities. In this paper, we propose a mathematically-grounded learned index framework with dynamic ϵ, which is efficient and pluggable to existing learned index methods. We theoretically analyze prediction error bounds that link ϵ with data characteristics for an illustrative learned index method. Under the guidance of the derived bounds, we learn how to vary ϵ and improve the index performance with a better space-time trade-off. Experiments with real-world datasets and several state-of-the-art methods demonstrate the efficiency, effectiveness and usability of the proposed framework.

1. INTRODUCTION

Data indexing (Graefe & Kuno, 2011; Wang et al., 2018; Luo & Carey, 2020; Zhou et al., 2020) , which stores keys and corresponding payloads with designed structures, supports efficient query operations over data and benefits various data retrieval applications. Recently, Machine Learning (ML) models have been incorporated into the design of index structure, leading to substantial improvements in terms of both storage space and querying efficiency (Kipf et al., 2019; Ferragina & Vinciguerra, 2020a; Vaidya et al., 2021) . The key insight behind this trending topic of "learned index" is that the data to be indexed contain useful distribution information and such information can be utilized by trainable ML models that map the keys {x} to their stored positions {y}. To approximate the data distribution, state-of-the-art (SOTA) learned index methods (Galakatos et al., 2019; Kipf et al., 2020; Ferragina & Vinciguerra, 2020b; Stoian et al., 2021) propose to learn piece-wise linear segments S = [S 1 , ..., S i , ..., S N ], where S i : y = a i x + b i is the linear segment parameterized by (a i , b i ) and N is the total number of learned segments. These methods introduce an important pre-defined parameter ϵ ∈ Z >1 to guarantee the worst-case preciseness: |S i (x) -y| ≤ ϵ for i ∈ [N ]. By tuning ϵ, various space-time preferences from users can be met. For example, a relatively large ϵ can result in a small index size while having large prediction errors, and on the other hand, a relatively small ϵ provides users with small prediction errors while having more learned segments and thus a large index size. Existing learned index methods implicitly assume that the whole dataset to be indexed contains the same characteristics for different localities and thus adopt the same ϵ for all the learned segments. However, the scenario where there is varied local data distribution, is very common, leading to sub-optimal index performance of existing methods. For example, the real-world Weblog dataset used in our experiment has typically non-linear temporal patterns caused by online campus transactions such as class schedule arrangements, weekends and holidays. More importantly, the impact of ϵ on index performance is intrinsically linked to data characteristics, which are not fully explored and utilized by existing learned index methods. Motivated by these, in this paper, we theoretically analyze the impact of ϵ on index performance, and link the characteristics of data localities with the dynamic adjustments of ϵ. Based on the

