LEARNED INDEX WITH DYNAMIC ϵ

Abstract

Index structure is a fundamental component in database and facilitates broad data retrieval applications. Recent learned index methods show superior performance by learning hidden yet useful data distribution with the help of machine learning, and provide a guarantee that the prediction error is no more than a pre-defined ϵ. However, existing learned index methods adopt a fixed ϵ for all the learned segments, neglecting the diverse characteristics of different data localities. In this paper, we propose a mathematically-grounded learned index framework with dynamic ϵ, which is efficient and pluggable to existing learned index methods. We theoretically analyze prediction error bounds that link ϵ with data characteristics for an illustrative learned index method. Under the guidance of the derived bounds, we learn how to vary ϵ and improve the index performance with a better space-time trade-off. Experiments with real-world datasets and several state-of-the-art methods demonstrate the efficiency, effectiveness and usability of the proposed framework.



Other ϵ-bounded learned index methods learn linear segments in a similar manner to MET while having different mechanisms to determine the parameters of {S i } such as FITing-Tree (Galakatos et al., 2019) , PGM (Ferragina & Vinciguerra, 2020b) and Radix-Spline (Kipf et al., 2020) . However, they both constrain all learned segments with the same ϵ. In this paper, we study how to enhance existing learned index methods from a new perspective: dynamic adjustment of ϵ accounting for diversity of different data localities, and present new theoretical results about the effect of ϵ. In Appx. A, we provide more detailed description and comparison to existing learned index methods. 3 LEARN TO VARY ϵ

3.1. PROBLEM FORMULATION AND MOTIVATION

Before introducing the proposed framework, we first formulate the task of learning index from data with ϵ guarantee, and provide some discussions about why we need to vary ϵ. Given a dataset D to be indexed and an ϵ-bounded learned index algorithm A, we aim to learn linear segments S = [S 1 , ..., S i ..., S N ] with segment-wise varied [ϵ i ] i∈ [N ] , such that a better trade-off between storage cost (size in KB) and query efficiency (time in ns) can be achieved than the ones using fixed ϵ. Let D i ⊂ D be the data whose keys are covered by S i , for the remaining data D \ j<i D j , the algorithm A repeatedly checks whether the prediction error of new data point violates the given ϵ i and outputs the learned segment S i . When all the ϵ i s for i ∈ [N ] take the same value, the problem becomes the one that existing learned index methods are dealing with. To facilitate theoretical analysis, we focus on two proxy quantities for the target space-time trade-off: (1) the number of learned segments N and (2) the mean absolute prediction error MAE(D i |S i ), which is affected and upper-bounded by ϵ i . We note that the improvements of N -MAE trade-off fairly and adequately reflect the improvements of the space-time trade-off: (1) The learned segments size in bytes and N are positively correlated and only different by a constant factor, e.g., the size of a segment can be 128bit if it consists of two double-precision float parameters (slope and intercept); (2) When using exponential search, the querying complexity is O(log(N ) + log(MAE(D i |S i )), in which the first term indicates the finding process of the specific segment S ′ that covers the key x for a queried data point (x, y), and the second term indicates the search range |ŷ -y| for true position y based on the estimated one ŷ = S ′ (x). In this paper, we adopt exponential search as search algorithm since it is better than binary search for exploiting the predictive ability of learned models. In Appx. B, we show that the search range of exponential search is O(MAE(D i |S i )), which can be much smaller than the one of binary search, O(ϵ i ), especially for strong predictive models and the datasets having clear linearity. Similar empirical support can be also found from (Ding et al., 2020) . Now let's examine how the parameter ϵ affects the N -MAE trade-off. We can see that these two performance terms compete with each other and ϵ plays an important role to balance them. If we adopt a small ϵ, the prediction error constraint is more frequently violated, leading to a large N ; meanwhile, the preciseness of learned index is improved, leading to a small MAE of the whole data MAE(D|S). On the other hand, with a large ϵ, we will get a more compact learned index (i.e., a small N ) with larger prediction errors (i.e., a large MAE(D|S)). Actually, the effect of ϵ on index performance is intrinsically linked to the characteristic of the data to be indexed. For real-world datasets, an important observation is that the linearity degree varies in different data localities. Recall that we use piece-wise linear segments to fit the data, and ϵ determines the partition and the fitness of the segments. By varying ϵ, we can adapt to the local variations of D and adjust the partition such that each learned segment fits the data better. Formally, let's consider the quantity SegErr i that is defined as the total prediction error within a segment S i , i.e., SegErr i ≜ (x,y)∈Di |y -S i (x)|, which is also the product of the number of covered keys Len(D i ) and the mean absolute error MAE(D i |S i ). Note that a large Len(D i ) leads to a small N since |D| = N i=1 Len(D i ). From this view, the quantity SegErr i internally reflects the N -MAE trade-off. Later we will show how to leverage this quantity to dynamically adjust ϵ.

3.2. OVERALL FRAMEWORK

In practice, it is intractable to directly solve the problem formulated in Section 3.1. With a given ϵ i , the one-pass algorithm A determines S i and D i until the error bound ϵ i is violated. In other words, ⃝ The ϵ-learner predicts a suitable ϵ i accordingly, and 4 ⃝ we learn a new segment S i using A (e.g., PGM) with ϵ i . 5 ⃝ Once S i triggers the violation of ϵ i , the ϵ-learner is updated and enhanced with the rewarded ground-truth. Steps 2 ⃝ to 5 ⃝ repeat in an online manner to approximate the distribution of D. it is unknown what the data partition {D i } will be a priori, which makes it impossible to solve the problem by searching among all the possible {ϵ i }s and learning index with a set of given {ϵ i }. In this paper, we investigate how to efficiently find an approximate solution to this problem via the introduced ϵ-learner module. Instead of heuristically adjusting ϵ, the ϵ-learner learns to predict the impact of ϵ on the index structure and adaptively adjusts ϵ in a principled way. Meanwhile, the introducing of ϵ-learner should not sacrifice the efficiency of the original one-pass learned index algorithms, which is important for real-world practical applications. These two design considerations establish our dynamic ϵ framework as shown in Figure 1 . The ϵ-learner is based on an estimation function SegErr = f (ϵ, µ, σ) that depicts the mathematical relationships among ϵ, SegErr i and the characteristics µ, σ of the data to be indexed. As a start, users can provide an expected ε that indicates various preferences under space-sensitive or time-sensitive applications. To meet the user requirements, afterward, we internally transform the ε into another proxy quantity SegErr, which reflects the expected prediction error for each segment if we set ϵ i = ε. This transformation also links the adjustment of ϵ and data characteristics together, which enables the data-dependent adjustment of ϵ. Beginning with ε, the ϵ-learner chooses a suitable ϵ i according to current data characteristics, then learns a segment S i using A, and finally enhances the ϵ-learner with the rewarded ground-truth SegErr i of each segment. To make the introduced adjustment efficient, we propose to only sample a small Look-ahead data D ′ to estimate the characteristics (µ, σ) of the following data locality. The learning process repeats and is also in an efficient one-pass manner.foot_0  Note that the proposed framework provides users the same interface as the ones used by original learned index methods. That is, we do not add any additional cost to the users' experience, and users can smoothly and painlessly use our framework with given ε just as they use the original methods with given ϵ. The ϵ is an intuitive, meaningful, easy-to-set and method-agnostic quantity for users. On the one hand, we can easily impose restrictions on the worst-case querying cases with ϵ as the data accessing number in querying process is O(log(|ŷ -y|)). On the other hand, ϵ is easier to estimate than the other quantities such as index size and querying time, which are dependent on specific algorithms, data layouts, implementations and experimental platforms. Our pluggable framework retains the benefits of existing learned index methods, such as the aforementioned usability of ϵ, and the ability to handle dynamic update case and hard size requirement.foot_1  We have seen how ϵ determines index performance and how SegErr i embeds the N -MAE trade-off in Section 3.1. In Section 3.3, we further theoretically analyze the relationship among ϵ, SegErr i , and data characteristics µ, σ at different localities. Based on the analysis, we elaborate on the details of ϵ-learner and the internal transformation between ϵ and SegErr i in Section 3.4.

3.3. PREDICTION ERROR ESTIMATION

In this section, we theoretically study the impact of ϵ on the prediction error SegErr i of each learned segment S i . The derived closed-form relationships will be taken into account in the design of the proposed ϵ-learner module (Section 3.4). Specifically, for the MET algorithm, we can prove the following theorem to bound the expectation of SegErr i with ϵ and the key interval distribution of D. Theorem 1. Given a dataset D to be indexed and an ϵ where ϵ ∈ Z >1 , consider the setting of the MET algorithm (Ferragina et al., 2020) , in which key intervals of D are drawn from a random process consisting of positive i.i.d. random variables with mean µ and variance σ 2 , and ϵ ≫ σ/µ. For a learned segment S i and its covered data D i , denote SegErr i = (x,y)∈Di |y -S i (x)|. Then the expectation of SegErr i satisfies: 1 π µ σ ϵ 2 < E[SegErri] < 2 3 2 π ( 5 3 ) 3 4 ( µ σ ) 2 ϵ 3 . Note that the average length of segments obtained by the MET algorithm is ( µ σ ) 2 ϵ 2 . We now get a constant 2 3 2 π ( 53 ) 3foot_3 ≈ 0.78 that is tighter than the trivial one with 1 corresponding to the case where each data point reaches the largest error ϵ. This theorem reveals that the prediction error SegErr i depends on both ϵ and the data characteristics (µ, σ). Recall that CV =σ/µ is the coefficient of variation, a classical statistical measure of the relative dispersion of data points. In the context of the linear approximation, the data statistic 1/CV = µ/σ in our bounds intrinsically corresponds to the linearity degree of the data. With this, we can find that when µ/σ is large, the data is easy-to-fit with linear segments, and thus we can choose a small ϵ to achieve precise predictions. On the other hand, when µ/σ is small, it becomes harder to fit the data using a linear segment, and thus ϵ should be increased to absorb some non-linear data localities. In this way, we can make the total prediction error for different learned segments consistent and achieve a better N -MAE trade-off. This analysis also confirms the motivation of varying ϵ: The local linearity degrees of the indexed data can be diverse, and we should adjust ϵ according to the local characteristic of the data, such that the learned index can fit and leverage the data distribution better. In the rest of this section, we provide a proof sketch of this theorem due to the space limitation. For detailed proof, please refer to our Appx. E. • TRANSFORMED RANDOM WALK. The main idea is to model the learning process of linear approximation with ϵ guarantee as a random walk process, and consider that the absolute prediction error of each data point follows folded normal distributions. Specifically, given a learned segment S i : y = a i x + b i , we can calculate the expectation of SegErr i for this segment as: E[SegErri] = aiE   (j * -1) j=0 |Zj|   = ai ∞ n=1 E n-1 j=0 |Zj| Pr(j * = n), where Z j is the j-th position of a transformed random walk {Z j } j∈N , j * = max{j ∈ N| -ϵ/a i ≤ Z j ≤ ϵ/a i } is the random variable indicating the maximal position when the random walk is within the strip of boundary ±ϵ/a i , and the last equality is due to the definition of expectation. • PROOF OF UPPER BOUND. Under the MET setting where a i = 1/µ and ϵ ≫ σ/µ, we find that the increments of the transformed random walk {Z j } have zero mean and variance σ 2 , and many steps are necessary to reach the random walk boundary. With the Central Limit Theorem, we assume the Z j follows normal distribution with mean µ zj = 0 and variance σ 2 zj = jσ 2 , and thus |Z j | follows the folded normal distribution with expectation E(|Z j |) = 2/πσ √ j. Thus Eq. (1) becomes: 1 µ ∞ n=1 E n-1 j=0 |Zj| Pr(j * = n) < 1 µ ∞ n=1 n-1 j=0 E [|Zj|]Pr(j * = n) = σ µ 2 π ∞ n=1 n-1 j=0 j Pr(j * = n). Using E[j * ] = µ 2 σ 2 ϵ 2 and V ar[j * ] = 2 3 µ 4 σ 4 ϵ 4 as derived in Ferragina et al. (2020) , we get E[(j * ) 2 ] = 5 3 µ 4 σ 4 ϵ 4 . With the inequality n-1 j=0 √ j < 2 3 n √ n and E[X 3 4 ] ≤ (E[X]) 3 4 , we get the upper bound: E[SegErri] < 2 3 2 π σ µ E[(j * ) 3 2 ] ≤ 2 3 2 π σ µ E[(j * ) 2 ] 3 4 = 2 3 2 π ( • PROOF OF LOWER BOUND. Applying the triangle inequality into Eq. ( 1), we can get E[SegErr i ] > 1 µ ∞ n=1 E [|Z|] Pr(j * = n) , where Z = n-1 j=0 Z j , and Z follows the normal distribution since Z j ∼ N (0, σ 2 zj ). We can prove that |Z| follows the folded normal distribution whose expectation E[|Z|] > σ(n -1)/ √ π. Thus the lower bound is: E[SegErri] > σ µ 1 π ∞ n=1 (n -1) Pr(j * = n) = σ µ 1 π E [j * -1] = 1 π ( µ σ ϵ 2 - σ µ ). Since ϵ ≫ σ µ , we can omit the right term 1/π • σ/µ and finish the proof. Although the derivations are based on the MET algorithm whose slope is the reciprocal of µ, we found that the mathematical forms among ϵ, µ/σ and SegErr i are still applicable to other ϵ-bounded methods, and further prove that the learned segment slopes of these methods are similar with bounded differences in Appx. F.

3.4. ϵ-LEARNER

Now given an ϵ, we have obtained the closed-form bounds of the SegErr in Theorem 1, and both the upper and lower bounds are in the form of w 1 ( µ σ ) w2 ϵ w3 , where w 1,2,3 are some coefficients. As the concrete values of these coefficients can be different for different datasets and different methods, we propose to learn the following trainable estimator to make the error prediction more precise: SegErr = f (ϵ, µ, σ) =w1( µ σ ) w 2 ϵ w 3 , s.t. 1 π ≤ w1 ≤ 2 3 2 π ( 5 3 ) 3 4 , 1 ≤ w2 ≤ 2, 2 ≤ w3 ≤ 3. ( ) With this learnable estimator, we feed data characteristic µ/σ of the look-ahead data and the transformed SegErr into it and find a suitable ϵ * as SegErr/w 1 ( µ σ ) w2 1/w3 . We will discuss the look-ahead data and the transformed SegErr in the following paragraphs. Now let's discuss the reasons for how this adjustment can achieve better index performance. Actually, the ϵ-learner proactively plans the allocations of the total prediction error indicated by the user (i.e., ε • |D|) and calculates the tolerated SegErr for the next segment. By adjusting current ϵ to ϵ * , the following learned segment can fully utilize the distribution information of the data and achieve better performance in terms of N -MAE trade-off. To be specific, when µ/σ is large, the local data has clear linearity, and thus we can adjust ϵ to a relatively small value to gain precise predictions; although the number of data points covered by this segment may decrease and then the number of total segments increases, such cost paid in terms of space is not larger than the benefit we gain in terms of precise predictions. Similarly, when µ/σ is small, ϵ should be adjusted to a relatively large value to lower the learning difficulty and absorb some non-linear data localities; in this case, we gain in terms of space while paying some costs in terms of prediction accuracy. The segment-wise adjustment of ϵ improves the overall index performance by continually and data-dependently balancing the cost of space and preciseness. Look-ahead Data. To make the training and inference of the ϵ-learner light-weight, we propose to look ahead a few data D ′ to reflect the characteristics of the following data localities. Specifically, we leverage a small subset D ′ ⊂ D \ j<i D j to estimate the value µ/σ for the following data. In practice, we set the size of D ′ to be 404 when learning the first segment as initialization, and 1 (i-1) i-1 j=1 Len(D j ) • ρ for the other following segments. Here ρ is a pre-defined parameter indicating the percentage that is relative to the average number of covered keys for learned segments, considering that the distribution of µ/σ can be quite different to various datasets. As for the first segment, according to the literature (Kelley, 2007) , the sample size 404 can provide a 90% confidence interval for a coefficient of variance σ/µ ≤ 0.2. SegErr and Optimization. As aforementioned, taking the user-expected ε as input, we aim to reflect the impact of ε with a transformed proxy quantity SegErr such that the ϵ-learner can choose suitable ϵ * to meet users' preference while achieving better N -MAE trade-off. Specifically, we make the value of SegErr updatable, and update it to be SegErr = w 1 (μ/σ) w2 εw3 once a new segment is learned, where μ/σ is the mean value of all the processed data so far. This strategy enables us to promptly incorporate both the user preference and the data distribution into the calculation of SegErr. As for the optimization of the light-weight model, i.e., f (ϵ, µ, σ) that contains only three learnable parameters w 1,2,3 , we adopt the projected gradient descent den Hertog & Roos (1991) with the parameter constraints in Eq. ( 2). In this way, we only need to track a few statistics and learn the ϵ estimator in an efficient one-pass manner. The overall algorithm is summarized in Appx. G.

4.1. EXPERIMENTAL SETTINGS

Baselines and Metrics. We apply our framework to several SOTA learned index methods, including MET (Ferragina et al., 2020) , FITing-Tree (Galakatos et al., 2019) , Radix-Spline, and PGM (Ferragina & Vinciguerra, 2020b) . For evaluation, we consider the index performance in terms of its learned segments N , size in bytes, prediction preciseness MAE, and the total querying time in ns. For a quantitative comparison w.r.t. the trade-off improvements, we calculate the Area Under the N-M AE Curve (AUNEC) where the x-axis and y-axis indicate N and MAE respectively. For AUNEC metric, the smaller, the better. More introduction and implementation details are in Appx. H. Datasets. We use several widely adopted datasets that are from real-world applications and differ in data scales and distributions (Kraska et al., 2018; Galakatos et al., 2019; Ding et al., 2020; Ferragina & Vinciguerra, 2020b; Li et al., 2021b) , including Weblogs and IoT (timestamp keys), Map (location coordinate keys), and Lognormal (synthetic keys). More details and visualization are in Appx. I.

4.2. OVERALL INDEX PERFORMANCE

N -MAE Trade-off Improvements. In Table 1 , we summarize AUNEC improvements in percentage brought by the proposed framework of all the baseline methods on all the datasets. We also illustrate the N -MAE trade-off curves for some cases in Figure 2 , where the blue curves indicate results achieved by fixed ϵ version while the red curves are for dynamic ϵ. Other baselines and datasets yield similar curves, which we include in Appx. J.1 due to the space limitation. These results show that the dynamic ϵ versions of all the baseline methods achieve much better N -MAE trade-off (-15.66% to -22.61% averaged improvements as smaller AUNEC indicates better performance), demonstrating the effectiveness and wide applicability of the proposed framework. As discussed in previous sections, datasets usually have diverse key distributions at different data localities, and the proposed framework can data-dependently adjust ϵ to fully utilize the distribution information of data localities and thus achieve better index performance in terms of N -MAE trade-off. Here the Map dataset has significant non-linearity caused by spatial characteristics, and it is hard to fit using linear segments (all baseline methods learn linear segments), thus relatively small improvements are achieved. examine whether the performance improvements w.r.t. N -MAE trade-off (i.e., Table 1 ) can lead to better querying efficiency in real-world systems, we show the averaged total querying time per query and the actual learned index size in bytes for two scenarios in Figure 3 . We also mark the 99th percentile (P99) latency as the right bar. We can observe that the dynamic ϵ versions indeed gain faster average querying speed, since we improve both the term N as well as the term |y -ŷ| via adaptive adjustment of ϵ. Besides, we find that the dynamic version achieves comparable or even better P99 results than the static version, due to the fact that our method effectively adjusts ϵ based on the expected ε and data characteristic, making the {ϵ i } fluctuated within a moderate range and leading to good robustness. A similar conclusion can be drawn from other baselines and datasets, and we present their results in Appx. J.1. Another thing to note is that, this experiment also verifies the usability of our framework in which users can flexibly set the expected ε to meet various space-time preferences just as they set ϵ in the original learned index methods. Index Building Cost. Compared with the original learned index methods that adopt a fixed ϵ, we introduce extra computation to dynamically adjust ϵ in the index building stage. Does this affect the efficiency of the original methods? Here we report the relative increments of building times in Table 2 . From it, we can observe that the proposed dynamic ϵ framework achieves comparable building times to all the original learned index methods on all the datasets, showing the efficiency of our framework since it retains the online learning manner with the same complexity as the original methods (both in O(|D|)). Note that we only need to pay this extra cost once, i.e., building the index once, and then the index structures can accelerate the frequent data querying operations for real-world applications. We summarize the AUNEC changes in percentage compared to the proposed framework in Table 3 . Here we only report the results for FITing-Tree due to the space limitation and similar results can be observed for other methods. Recall that for AUNEC, the smaller, the better. From this table, we have the following observations: (1) The Random ϵ version achieves much worse results than the proposed dynamic ϵ framework, showing the necessity and effectiveness of learning the impact of ϵ. (2) The Polynomial Learner achieves better results than the Random ϵ version while still having a large performance gap compared to our proposed framework. This indicates the usefulness of the derived theoretical results that link the index performance, the ϵ, and the data characteristics together. (3) For the Least Square Learner, we can see that it achieves similar AUNEC results compared with the proposed framework. However, it has higher computational complexity and pays the cost of much larger building times, e.g., 14× and 53× longer building times on IoT and Map respectively. These results demonstrate the effectiveness and efficiency of the proposed framework that adjusts ϵ based on the theoretical results, which will be validated next. We visualize the partial learned segments for FITing-Tree with fixed and dynamic ϵ on IoT dataset in Figure 4 , where the N and SegErr i indicates the number of learned segments and the total prediction error for the shown segments respectively. The --→ µ/σ indicates the characteristics of covered data {D i }. We can see that our dynamic framework helps the learned index gain both smaller space (7 v.s. 4) and smaller total prediction errors (48017 v.s. 29854). Note that ϵs within -→ ϵ i are diverse due to the diverse linearity of different data localities: For the data whose positions are within about [30000, 30600] and [34700, 35000], the proposed framework chooses large ϵs as their µ/σs are small, and by doing so, it achieves smaller N than the fixed version by absorbing these non-linear localities; For the data in the middle part, they have clear linearity with large µ/σs, and thus the proposed framework adjusts ϵ as 19 and 10 that are smaller than 32 to achieve better precision. These experimental observations are consistent with our analysis in the paragraph under Eq. ( 2), and clearly confirm that the proposed framework adaptively adjusts ϵ based on data characteristics.

4.5. MORE EXPERIMENTS IN APPENDIX

Due to the space limitation, we provide further experiments and analysis in Appendix, including: • More results about the overall index performance (Appx.J.1) and ablation studies (Appx.J.2) on other datasets and methods, which support similar conclusions in Sec.4.2 and Sec.4.4. • The theoretical validation (Appx.J.3) for Theorem 1 that SegErr i is within the derived bounds, and Theorem 2 that the learned slopes of various ϵ-bounded methods have the same trends and both concentrate on 1/µ i with a bounded difference. This shows that baselines have the same mathematical forms as we derived, and the proposed ϵ-learner works well with wide applicability. • The insights about which kinds of datasets will benefit from our dynamic adjustment (Appx.J.4), with the help of an indicative quantity, the coefficient of variation value (σ/µ).

5. CONCLUSIONS

Existing learned index methods introduce an important hyper-parameter ϵ to provide a worst-case preciseness guarantee and meet various space-time user preferences. In this paper, we provide formal analyses about the relationships among ϵ, data local characteristics, and the introduced quantity SegErr i for each learned segment, which is the product of the number of covered keys and MAE, and thus embeds the space-time trade-off. Based on the derived bounds, we present a pluggable dynamic ϵ framework that leverages an ϵ learner to data-dependently adjust ϵ and achieve better index performance in terms of space-time trade-off. A series of experiments verify the effectiveness, efficiency, and usability of the proposed framework. We believe that our work contributes a deeper understanding of how the ϵ impacts the index performance, and enlightens the exploration of fine-grained trade-off adjustments by considering data local characteristics. Our study also opens several interesting future works. For example, we can apply the proposed framework to other problems in which the piece-wise approximation algorithms with fixed ϵ are used while still requiring space-time trade-off, such as similarity search and lossy compression for time series data (Chen et al., 2007; Xie et al., 2014; Buragohain et al., 2007; O'Rourke, 1981) . data localities, as shown in our mathematical derivation linking the indexing performance, ϵ and data statistics µ/σ (Sec. 3.3) and the proposed ϵ-learner (Sec. 3.4). Besides, different from (Ferragina et al., 2020) which reveals the relationship between ϵ and index size performance based on MET. In Sec.3.3, we give novel analyses about the impact of ϵ on not only index size, but also index preciseness and a comprehensive trade-off quantity, which facilitates the proposed dynamic ϵ adjustment. A.2 DATA-LAYOUT-OPTIMIZATION BASED METHODS In this paper, we mainly focus on the ϵ-based learned index methods. In recent years, some datalayout-optimization based methods also gain promising indexing performance. For example, ALEX (Ding et al., 2020) improves Recursive Model Index (RMI) Kraska et al. (2018) by reserving gaps within arrays to enhance the ability to update indexing case (insert, delete, update, etc.,). LIPP (Wu et al., 2021) proposes to extend the tree structure with zero prediction error for update operations and proposes an adjustment strategy to provide a bounded height of the tree index. NFL (Wu et al., 2022) proposes to transform the complex key distribution into a near-uniform distribution. CARMI (Zhang & Gao, 2022 ) leverages an entropy-based cost model to improve the data partitioning of tree nodes in learned indexes. FINEdex (Li et al., 2021a) focuses on concurrent and independent model processing with the help of a fattened data structure. These two performance optimization perspectives are relatively orthogonal. Both types of learned index methods have their pros and cons: Worst-case guarantee. For the ϵ-bounded methods (Ferragina & Vinciguerra, 2020b; Stoian et al., 2021; Galakatos et al., 2019; Ferragina et al., 2020; Kipf et al., 2020) , the most important advantage they provide is the worst-case guarantee in each segment. This property is fairly valuable in many realistic indexing applications such as financial databases and on-device intelligence. Also note that our approach still maintains comparable 99th percentile (P99) performance compared to static ϵ baselines as shown in the overall index performance comparison (Figure 3 and Figure 10 ). Index Size. Data-layout-based approaches such as ALEX (Ding et al., 2020) do achieve better indexing performance in dynamic scenarios, but it pays a larger index size overhead because of the introduced gap insertion technique (reserving empty space for possibly inserted data). Empirically, we examine the performance of ALEX with our experimental settings and find that it gains comparable query time and larger index size than the learned index with dynamic ϵ as the following Table 4 shows, where the q time and index size are in ns and KB respectively. We adopt the default hyper-parameters of ALEX and the ϵ is 4, 32, 64, and 32 in these four datasets respectively for RadixSpline. We note that the index size is still very important even with a large amount of memory available today, as shown by the fact that almost all related learned index works examine this metric in their experiments. It is worth noting that index size can have a significant impact on latency indirectly, since the smaller the index, the more it can fit into the hierarchical high-speed CPU caches, and thus increase the hit rates to speed up the overall indexing performance Zhang & Gao (2022); Zhang et al. (2021) . Moreover, in some practical applications, the available memory can be relatively small such as in cases where users need to build indexes from multiple keys of the data, and need to use the index on IoT devices for edge computation. Dynamic Hyper-parameter. Moreover, our novel framework can be regarded as an automatic method that determines hyper-parameters (ϵ) according to the varied local properties of the data. Although the data layout optimization goes beyond our scope, our insight can also contribute to the ALEX approach, since it also introduces hyper-parameters in local linear segments learning, the lower and upper density limits on each gapped array: d l , d u ∈ (0, 1]. The authors empirically set them to be 0.6 and 0.8 respectively. However, different gapped arrays may gain better performance with different density limitations, since the data distribution can be varied across different localities, as we have shown in experiments.

B CONNECTING PREDICTION ERROR WITH SEARCHING STRATEGY

As we mentioned in Section 3.1, we can find the true position of the queried data point in O(log(N ) + log(|ŷ -y|)) where N is the number of learned segments and |ŷ -y| is the absolute prediction error. A binary search or exponential search finds the stored true position y based on ŷ. It is worth pointing that searching in terms of rangeof search absolute the corresponds the prediction In this paper, we decouple the quantity SegErr i as the product of Len(D i ) and MAE(D i |S i ) in the derivation of Theorem 1. Built upon the theoretical analysis, we adopt exponential search in experiments to better leverage the predictive models. To clarify, let's consider a learned segment S i with its covered data D i . Let | ŷk -y k | be the absolute prediction error of k-th data point covered by this segment, and ϵ i be the maximum absolute prediction error of S i , i.e., | ŷk - y k | ≤ ϵ i for all k ∈ [len(D i )]. • The binary search is conducted within the searching range [ ŷk ± ϵ i ] for each data pointfoot_4 , thus the mean search range is 1 len(Di) len(Di) k=1 2ϵ i = O(ϵ i ) , which is independent of the preciseness of the learned segment and an upper bound of MAE(D i |S i ). • The exponential search first finds the searching range where the queried data may exist by centering around the ŷ, repeatedly doubling the range [ŷ ± 2 q ] where the integer q grows from 0, and comparing the queried data with the data points at positions ŷ ± 2 q . After finding the specific range such that a q k satisfies 2 log(q k )-1 ≤ | ŷk -y k | ≤ 2 ⌈log(q k )⌉ for the k-th data, a binary search is conducted to find the exact location. In this way, the mean search range is 1 len(Di) len(Di) k=1 (2 ⌈log(q k )⌉+1 ) = O MAE(D i |S i ) , which can be much smaller than O(ϵ i ), especially for strong predictive models and the datasets having clear linearity.

C NOTATIONS

We summarize the adopted notations in Table 5 for convenience. The i-th segment of learned index D The whole data to be indexed D i The data covered by the i-th segment ε Expected ϵ given by user ϵ i The maximum prediction error of segment i SegErr i The sum of the errors for the data covered by the i-th segment SegErr Expected segment error calculated based on ε D ′ Sampled data to reflect the characteristics of the following data localities. ρ The hyperparameter to determine the size of D ′ μ Mean of data intervals in D ′ σ Standard deviation of data intervals in D ′ f (•) The learnable function for determine ϵ i w 1 , w 2 , w 3 Learnable parameters of the epsilon-learner (f )

D INHERITING THE ABILITIES OF EXISTING WORKS

In this Section, we discuss the benefits of our proposed framework brought by its pluggable property with two example scenarios, the dynamic data update and hard limitation on user-required index size. We note that the data insert operation has been discussed in the baseline FITing-Tree et (2019) PGM & (2020b). importantly, when the and relied their ϵ-bounded piece-wise segmentation algorithms. The proposed framework is still valid when using their respective solutions to handle data insertion. Specifically, FITing-Tree proposes to introduce a buffer for each learned segment, which is used to store the inserted keys, and when the buffer is full, the data covered by the segment will be re-segmented (see Section 5 in Galakatos et al. (2017) ). PGM adopts a logarithmic method O'Neil et al. (1996) ; Overmars (1987) that maintains a series of sorted sets {S 0 , S 1 , ..., S b } where b = θ(log(|D|)), and builds multiple PGM-INDEX models over the sets. When a key x is inserted, a new PGM-INDEX will be built over the merged sets (see Section 3 in Ferragina & Vinciguerra (2020b) ). In general, these solutions proposed by existing methods for inserting keys are based on re-indexing for a piece of data along with the inserted data, and the re-indexing processes are the same as the original piece-wise linear segmentation processes but for different data, therefore, we can still apply the proposed dynamic-ϵ framework for these methods in insertion scenarios just like we adjust ϵ and learn index according to the new data to be re-indexed. For the hard size limitation case, we observe that the existing work PGM introduced a multi-criteria variant that auto-tunes itself with pre-defined hard size requirements from users. Our proposed framework is pluggable and still valid when using the PGM variant to handle the size requirement. Specifically, given a space constraint, the multi-criteria PGM proposes to iteratively estimate the relationship between ϵ and size with a learnable function size(ϵ) = aϵ -b , and automatically outputs the index that minimizes its query time via different estimated ϵs. Given a size requirement, we can just do the same thing in a dynamic ϵ scene by setting our ε as ϵ estimated by the original PGM method. We present discussions on how the proposed framework can inherit the ability to handle the update case of existing works above. To show how the learned index with dynamic ϵ performs under different dynamic scenarios, as an example here, we simply replace the static ϵ parameter in the segment-building process of updatable PGM into a dynamic one predicted by the proposed ϵ learner and keep the update processing of PGM unchanged. We follow the read-heavy and write-heavy workloads setting similar to ALEX Ding et al. (2020) , where the workloads are composed of lookup operations and insert operations with different ratios. Specifically, after building learned indexes with a random subset of the whole dataset with the percentage R init ∈ (0, 100%), we repetitively perform N lookup random lookup operations and 1 insertion operation in a batch. To simulate different workloads, we set R init to be [5%, 10%, 20%, 40%, 60%, 80%] and N lookup to be [1, 2, 4, 8, 16] . We repeat the experiments 3 times and summarize the total throughput (operations/sec) and index size (KB) in Figure 6 . Generally speaking, compared with the PGM with static ϵ, we can observe that PGM with the ϵ adjustment achieves larger total throughput and smaller sizes in most cases, indicating the effectiveness of the proposed method. We also find that with larger ratios of write (i.e., smaller N lookup ) and smaller ratios of initial building data (R init ), the throughput improvements are less and the variances are larger. Note that the current modification with ϵ-learner mostly impacts the index building stage, and we do not design specific strategies in the insert stage to achieve more precise estimation for the data's local characteristics, which is a promising future direction to further optimize the index performance in update case.

E PROOF OF THEOREM 1

Given a learned segment S i : y = a i x + b i , denote c i as the stored position of the last covered for the (i -1)-th segment (c = for first We write expectation SegErr E[SegErr =  * -1) j=0 |a X + b i -(j + c i + 1)|   , where j * indicates the length of the segment, and X j indicates the j-th key covered by the segment S i . As studied in Ferragina et al. (2020) , the linear-approximation problem with ϵ guarantee can be modeled as random walk processes. Specifically, X j = X 0 + j k=0 G k (for j ∈ Z >0 ) where G k is the key increment variable whose mean and variance is µ and σ 2 respectively. Denote the Z j = X j -j/a i + (b i -c i -1)/a i as the j-th position of the transformed random walk {Z j } j∈N , and j * = max{j ∈ N| -ϵ/a i ≤ Z j ≤ ϵ/a i } as the random variable indicating the maximal position when the random walk is within the strip of boundary ±ϵ/a i . The expectation can be rewritten as: E   (j * -1) j=0 |aiXj -j + (bi -ci -1)|   = aiE   (j * -1) j=0 |Zj|   = ai ∞ n=1 E n-1 j=0 |Zj| Pr(j * = n). The last equality in Eq. ( 3) is due to the definition of expectation. Following the MET algorithm that the S i goes through the point (X 0 , Y 0 = c i + 1), we get b i = -a i X 0 + c i + 1 and we can rewrite Z j as the following form: Z0 = 0, Zj j>0 = Xj -X0 -j/ai = j k=1 G k -j/ai = j k=1 (G k -1/ai) = j k=1 (W k ), where W k is the walk increment variable of Z j , E[W k ] = µ -1/a i and V ar[W k ] = σ 2 . Under the MET algorithm setting where a i = 1/µ and ε ≫ σ/µ, the transformed random walk {Z j } has increments with zero mean and variance σ 2 , and many steps are necessary to reach the random walk boundary. With the Central Limit Theorem, we can assume that Z j follows the normal distribution with mean µ zj and variance σ 2 zj , and thus |Z j | follows the folded normal distribution: Z j ∼ N (µ -1/a i )j, jσ 2 , E(|Z j |) = µ zj [1 -2Φ( -µ zj /σ zj )] + σ zj 2/π exp(-µ 2 zj /2σ 2 zj ), where Φ is the normal cumulative distribution function. For the MET algorithm, a i = 1/µ and thus the µ zj = 0, σ zj = σ √ j, and E(|Z j |) = 2/πσ √ j. Then the Eq. ( 3) can be written as 1 µ ∞ n=1 E n-1 j=0 |Zj| Pr(j * = n) < 1 µ ∞ n=1 n-1 j=0 E [|Zj|] Pr(j * = n) = σ µ 2 π ∞ n=1 n-1 j=0 j Pr(j * = n). For the inner sum term in Eq. ( 4), we have ( n-1 j=0 √ j) < 2 3 n √ n since n-1 j=0 < j=0 + n 2 < n 0 √ x dx = 2 3 n √ n, then the result in Eq. ( 4) becomes E[SegErri] < 2 3 2 π σ µ ∞ n=1 n √ n Pr(j * = n) = 2 3 2 π σ µ E[(j * ) 3 2 ] = 2 3 2 π σ µ E (j * ) 2 3 4 ≤ 2 3 2 π σ µ E[(j * ) 2 ] 3 4 , where the last inequality holds due to the Jensen inequality E[X 3 4 ] ≤ (E[X]) 3 4 . Using E[j * ] = µ 2 σ 2 ϵ 2 and V ar[j * ] = 2 3 µ 4 σ 4 ϵ 4 derived in MET algorithm Ferragina et al. (2020), we get E[(j * ) 2 ] = 5 3 µ 4 σ 4 ϵ 4 , which yields the following upper bound: E[SegErr i ] < 2 3 2 π ( 5 3 ) 3 4 ( µ σ ) 2 ϵ 3 . For the lower bound, applying the triangle inequality into the Eq. ( 3), we have 1 µ ∞ n=1 E n-1 j=0 |Zj| Pr(j * = n) > 1 µ ∞ n=1 E | n-1 j=0 Zj| Pr(j * = n) = 1 µ ∞ n=1 E [|Z|] Pr(j * = n), where Z = n-1 j=0 Z j . Since Z j ∼ N (0, σ 2 zj ), the Z follows the normal distribution: Z ∼ N µ Z = 0, σ 2 Z = n-1 j=0 σ 2 zj + n-1 j=0 n-1 k=0,k̸ =j r jk σ zj σ zk , where r jk is the correlation between Z j and Z k . Since µ Z = 0, the |Z| follows the folded normal distribution with E[|Z|] = σ Z 2/π. Since the random walk {Z j } is a process with i.i.d. increments, the correlation r jk ≥ 0. With σ zj = σ √ j > 0 and r jk ≥ 0, we have E[|Z|] > 2 π n-1 j=0 σ zj > σ n(n -1)/π > σ(n -1) √ π , and the result in Eq. ( 5) becomes: E[SegErr i ] > 1 µ ∞ n=1 E   | n-1 j=0 Z j |   Pr(j * = n) > σ µ 1 π ∞ n=1 (n -1) Pr(j * = n) = σ µ 1 π E [j * -1] = 1 π ( µ σ ϵ 2 - σ µ ). Since ϵ ≫ σ µ , we can omit the right term 1 π σ µ and finish the proof.

F LEARNED SLOPES OF OTHER ϵ-BOUNDED METHODS

As shown in Theorem 1, we have known how ϵ impacts the SegErr i of each segment learned by the MET algorithm, where the theoretical derivations largely rely on the slope condition a i = 1/µ. Here we prove that for other ϵ-bounded methods, the learned slope of each segment (i.e., a i of S i ) concentrates on the reciprocal of the expected key interval as shown in the following Theorem. Theorem 2. Given an ϵ ∈ Z >1 and an ϵ-bounded learned index algorithm A. For a linear segment S i : y = a i x + b i learned by A, denote its covered data and the number of covered keys as D i and Len(D i ) respectively. Assuming the expected key interval of D i is µ i , the learned slope a i concentrates on ã = 1/µ i with bounded relative difference: (1 -2ϵ E[Len(Di)] -1 )ã ≤ E[ai] ≤ (1 + 2ϵ E[Len(Di)] -1 )ã. Proof. For the learned linear segment S i , denote its first predicted position and last predicted position as y ′ 0 and y ′ n respectively, we have its slope a i = y ′ n -y ′ 0 xn-x0 . Notice that y 0 -ϵ ≤ y ′ 0 ≤ y 0 + ϵ and y n -ϵ ≤ y ′ n ≤ y n + ϵ due to the ϵ guarantee, we have y n -y 0 -2ϵ ≤ y ′ n -y ′ 0 ≤ y n -y 0 + 2ϵ and the expectation of a i can be written as E[ yn -y0 + 2ϵ xn -x0 ] ≤ E[ai] = y ′ n -y ′ 0 xn -x0 ≤ E[ yn -y0 + 2ϵ xn -x0 ]. Note that for any learned segment S i whose first covered data is (x 0 , y 0 ) and last covered data is (x n , y n ), we have E[ xn-x0 yn-y0 ] = µ i and thus the inequalities become 1 µ -E[ 2ϵ x n -x 0 ] ≤ E[a i ] ≤ 1 µ + E[ 2ϵ x n -x 0 ]. Since ã = 1/µ i and E[x n -x 0 ] = (E[Len(D i )] -1)µ i , we finish the proof. The Theorem 2 shows that the relative deviations between learned slope a i and ã are within 2ϵ/(E[Len(D i )] -1). For the MET and PGM learned index methods, we have the following corollary that depicts more precise deviations without the expectation term E[Len(D i )]. Corollary 2.1. For the MET method Ferragina et al. (2020) and the optimal ϵ-bounded linear approximation method that learns the largest segment length used in PGM Ferragina & Vinciguerra (2020b) , the slope relative differences are at O(1/ϵ). Proof. We note that the segment length of a learned segment is at O(ϵ 2 ) for the MET algorithm, which is proved in the Theorem 1 of Ferragina et al. (2020) . Since PGM achieves the largest learned segment length that is larger than the one of the MET algorithm, we finish the proof.

G THE ALGORITHM OF DYNAMIC ϵ ADJUSTMENT

We summarize the proposed algorithm below. In Section 3.4, we provide detailed descriptions of the initialization and adjustment sub-procedures. The lookahead() and optimize() are in the "Look-ahead Data" and " SegErr and Optimization" paragraph respectively.

H IMPLEMENTATION DETAILS

Baselines. All the experiments are conducted on a Linux server with an Intel Xeon Platinum 8163 2.50GHz CPU. We first introduce more details and the implementation of baseline learned index methods. MET (Ferragina et al., 2020) fixes the segment slope as the reciprocal of the expected key interval, and goes through the first available data point for each segment. FITing-Tree (Galakatos et al., 2019) adopts a greedy shrinking cone algorithm and the learned segments are organized with a B + -tree. Here we use the stx::btree (v0.9) implementation (Bingmann, 2013) and set the filling factors of inner nodes and leaf nodes as 100%, i.e., we adopt the full-paged filling manner. Radix-Spline (Kipf et al., 2020) Learn new segment S i using adjusted ϵ * : 10: [S i , D i ] ← A(D, ϵ * ) 11: S ← S ∪ S i 12: D ← D \ D i , D S ← D S ∪ D i 13: Online update Len(D S ): 14: Len(D S ) ← running-mean Len(D S ), Len(D i ) 15: (μ/σ) ← running-mean (μ/σ), (µ/σ) 16: Train the learner with ground truth: 17: w 1,2,3 ← optimize(f, S i , SegErr i ) 18: SegErr ← w 1 (μ/σ) w2 εw3 19: until D = ∅ and the learned spline segments are organized with a flat radix table. We set the number of radix bits as r = 16 for the Radix-Spline method, which means that the leveraged radix table contains 2 16 entries. PGM (Ferragina & Vinciguerra, 2020b ) adopts a convex hull based algorithm to achieve the minimum number of learned segments, and the segments can be organized with the help of binary search, CSS-Tree (Rao & Ross, 1999) and recursive structure. Here we implement the recursive version since it beats the other two variants in terms of indexing performance. For all the baselines and our method, we adopt exponential search to better leverage the predictive models since the query complexity using exponential search corresponds to the preciseness of models (MAE) as we analyzed in Appendix B. Evaluation Metrics. We evaluate the index performance in terms of its size, prediction preciseness, and total querying time. Specifically, we report the number of learned segments N , the index size in bytes, the MAE as 1 |D| (x,y)∈D |y -S(x)|, and the total querying time per query in ns (i.e., we perform querying operations for all the indexed data, record the total time of getting the payloads given the keys, and report the time that is averaged over all the queries). For a quantitative comparison w.r.t. the trade-off improvements, we calculate the Area Under the N-M AE Curve (AUNEC) where the x-axis and y-axis indicate N and MAE respectively. For the AUNEC metric, the smaller, the better. Hyper-parameters. We describe a few additional details of the proposed framework in terms of the ϵ-learner initialization and the hyper-parameter setting. For the w 1,2,3 of the ϵ-learner shown in the Eq. ( 2), We empirically found that this light-weight initialization leads to better index performance compared to the versions with random parameter initialization, and it benefits the exploration of diverse ϵ * , i.e., leading to the larger variance of the dynamic ϵ sequence [ϵ 1 , . . . , ϵ i , . . . , ϵ N ]. As for the hyper-parameter ρ (described in the Section 3.4), we conduct a grid search over ρ ∈ [0.1, 0.4, 0.7, 1.0] on Map and IoT datasets. We found that all the ρs achieve better N -MAE trade-off (i.e., smaller AUNEC results) than the fixed ϵ versions. Since the setting ρ = 0.4 achieves averagely best results on the two datasets, we set ρ to be 0.4 for the other datasets.

I DATASET DETAILS

Our framework is verified on several widely adopted datasets having different data scales and distributions. Weblogs Kraska et al. ( 2018  (G i ) ∼ N (µ lg , σ 2 lg ). To simulate the varied data characteristics among different localities. We generate 20M keys with 40 partitions by setting µ lg = 1 and setting σ lg with a random number within [0.1, 1] for each partition. We normalize the positions of stored data into the range [0, 1], and thus the key-position distribution can be modeled as a Cumulative Distribution Function (CDF). We plot the CDFs and zoomed-in CDFs of experimental datasets in Figure 7 and Figure 8 respectively, which intuitively illustrate the diversity of the adopted datasets. For example, the CDF visualization of the Map dataset shows that it has a fairly shifted distribution across different data localities, verifying of the necessity of dynamically adapting and adjusting the learned index algorithms just as we considered in this paper. For the N -MAE trade-off improvements and the actual querying efficiency improvements brought by the proposed framework, we illustrate more N -MAE trade-off curves in Figure 9 and querying time results in Figure 10 . We also mark the 99th percentile (P99) latency as the right bar, which is a useful metric in industrial-scale practical systems. Recall that the N -MAE trade-off curve adequately reflects the index size and querying time: (1) the segment size in bytes and N are only different by a constant factor, e.g., the size of a segment can be 128bit if it consists of two double-precision float parameters (slope and intercept); (2) the querying operation can be done in O(log(N ) + log(|y -ŷ|) as we mentioned in Section 3.1, thus a better N -MAE trade-off indicates a better querying efficiency. From these figures, we can see that the dynamic ϵ versions of all the baseline methods achieve better N -MAE trade-off and better querying efficiency, verifying the effectiveness and the wide applicability of the proposed framework. Regards the p99 metrics, we can see that the dynamic version achieves comparable or even better P99 results than the static version, showing that the proposed method not only improves the average lookup time, but also has good robustness. This is because our method can effectively adjust ϵ based on the expected ε and data characteristic, making the {ϵ i } fluctuated within a moderate range.

J.2 ABLATION STUDY

To examine the necessity and the effectiveness of the proposed framework, in Section 4.3, we compare the proposed framework with three dynamic ϵ variants for the FITing-Tree method. Here we demonstrate the AUNEC relative changes for the Radix-Spline method with the same three variants in Table 6 and similar conclusions can be drawn. 

J.3 THEORETICAL RESULTS VALIDATION

We study the impact of ϵ on SegErr i for the MET algorithm in Theorem 1, where the derivations are based on the setting of the slope condition a i = 1/µ. To confirm that the proposed framework also works well with other ϵ-bounded learned index methods, we analyze the learned slopes of other ϵ-bounded methods in Theorem 2. In summary, we prove that for a segment S i : y = a i x + b i whose covered data is D i and the expected key interval of D i is µ i , then a i concentrates on 1/µ i within 2ϵ/(E[Len(D i )] -1) relative deviations. Here we plot the learned slopes of baseline learned index methods in Figure 11 (Map dataset) and in Figure 13 (IoT, Weblogs and Lognormal datasets). We can see that the learned slopes of other methods indeed center along the line a i = 1/µ i , showing the We further compare the theoretical bounds with the actual SegErr i for all the adopted learned index methods. We show the results on the Lognormal dataset in Figure 12 , and the results on another two datasets Gamma and Uniform in Figure 14 , where the key intervals of the latter two datasets follow gamma distribution and uniform distribution respectively. As expected, we can see that the MET method has the actual SegErr i within the derived bounds, verifying the correctness of the Theorem 1. Besides, the other ϵ-bounded methods show the same trends with the MET method, providing evidence that these methods have the same mathematical forms as we derived, and thus the ϵ-learner also works well with them. 

J.4 CV AS AN INDICATIVE QUANTITY

The coefficient of variation (CV) value, i.e., CV =σ/µ, plays an important factor in our bounds to reflect the linearity degree of the data. We have seen that CV is effective to help dynamically adjust ϵ in our framework as shown in our experiments. Here we explore that whether the CV value can be an indicative quantity to shed light on what types of data will benefit from our dynamic adjustment. To be specific, we calculate the CV values of the experimental datasets and compare them with the trade-off improvements. The global CV values of IoT, Map, Lognormal, and Weblogs are 65.24, 11.12, 0.85, and 0.013 respectively, while their AUNEC improved by 20.71%, 6.47%, 21.89%, and 26.96% respectively. With the exception of IoT, the rest of the results show that the smaller the CV value is, the greater the trade-off improvement of dynamic ϵ brings. We find that IoT is a locally linear but globally fluctuant dataset. We then divide the data into 5000 segments and calculate their average CV values. The local CV values of IoT, Map, Lognormal, and Weblogs are 0.95, 2.18, 0.63, and 0.005 respectively, which is consistent with the improvement trends. Intuitively, when the local CV value is small, the local data is hard-to-fit with a few linear segments if we adopt an improper ϵ, and we need more fine-grained ϵ adjustment rather than the fixed setting. Thus we can expect more performance improvements in this case. The calculation of actual CV values of real-world datasets helps to validate our ϵ analysis based on the CV values, and provides further insight into the scenarios where the proposed method has strong potential to outperform existing methods.



For better readability, we summarize the notations in Appx. C. We discuss how to extend existing works in more detail in Appx. D. ( µ σ) 2 ϵ 3 . The lower bound and upper bounds of searching ranges should be constricted to 0 and len(Di) respectively. For brevity, we omit the corner cases when comparing these two searching strategies as they both need to handle the out-of-bounds scenario.



Figure 1: The dynamic ϵ framework. We 1⃝ transform ε into the proxy prediction error SegErr, then 2 ⃝ sample a small look-ahead data D ′ to estimate the data characteristics (µ, σ). 3⃝ The ϵ-learner predicts a suitable ϵ i accordingly, and 4⃝ we learn a new segment S i using A (e.g., PGM) with ϵ i . 5 ⃝ Once S i triggers the violation of ϵ i , the ϵ-learner is updated and enhanced with the rewarded ground-truth. Steps 2 ⃝ to 5 ⃝ repeat in an online manner to approximate the distribution of D.

Figure 2: The N -MAE trade-off curves for learned index methods.

Figure 3: Improvements in terms of querying times for learned index methods with dynamic ϵ.

Figure 4: Visualization of the learned index (partial) on IoT for FITing-Tree with fixed ϵ = 32 and dynamic version ( ε = 32).

Figure 6: Total throughput (ops/sec) and index size (KB) results of the PGM with static and dynamic ϵ on two different read-write workloads.

adopts a greedy spline interpolating algorithm to learn spline points, Algorithm Dynamic ϵ Adjustment with Pluggable ϵ Learner Input: D: Data to be indexed, A: Learned index algorithm, ε: Expected ϵ, ρ: Length percentage for look-ahead data Output: S: Learned segments with varied ϵs 1: initial parameters w 1,2,3 of the learned function: f (ϵ, µ, σ) = w 1 ( µ σ ) w2 εw3 2: initial mean length of learned segments so far: Len(D S ) ← 404 3: S ← ∅, (μ/σ) σ) ← lookahead(D, Len(D S ) • ρ)

);Galakatos et al. (2019);Ferragina & Vinciguerra (2020b)   contains about 715M log entries for the requests to a university web server and the keys are log timestamps.IoT Galakatos et al. (2019);Ferragina & Vinciguerra (2020b)  contains about 26M event entries from different IoT sensors in a building and the keys are recording timestamps. Map dataset Kraska et al. (2018); Galakatos et al. (2019); Ding et al. (2020); Ferragina & Vinciguerra (2020b); Li et al. (2021b) contains location coordinates of 200M places that are collected around the world from the Open Street Map OpenStreetMap contributors (2017), and the keys are the longitudes of these places. Lognormal Ferragina & Vinciguerra (2020b) is a synthetic dataset whose key intervals follow the lognormal distribution: ln

Figure 7: of adopted datasets.

Figure 8: Zoomed-in CDFs of adopted datasets.

Figure 9: The additional N -MAE trade-off curves for learned index methods.

Figure 10: Improvements in terms of querying times for learned index methods with dynamic ϵ.

Figure 11: Learned slopes.

Figure 13: Learned slopes on the IoT, Weblogs and Lognormal datasets.

Figure 14: Illustrations of the derived bounds on Gamma and Uniform datasets.

The AUNEC relative improvements for learned index methods with dynamic ϵ.

Building time increments in percentage for learned index methods with dynamic ϵ.

The AUNEC relative changes of dynamic ϵ variants compared to the proposed framework.

Zhou Zhang, Peiquan Jin, Xiao-Liang Wang, Yan-Qi Lv, Shouhong Wan, and Xike Xie. COLIN: A cache-conscious dynamic learned index with high read/write performance. J. Comput. Sci. Technol., 36(4):721-740, 2021. Xuanhe Zhou, Chengliang Chai, Guoliang Li, and Ji Sun. Database meets artificial intelligence: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020.

The indexing performance comparison between ALEX and RadixSpline with the proposed dynamic ϵ framework.

The adopted Notations

The AUNEC relative changes of dynamic ϵ variants compared to the Radix-Spline method with the proposed framework.

APPENDIX FOR THE PAPER: LEARNED INDEX WITH DYNAMIC ϵ

Our appendix includes the following content:• Sec.A: further descriptions and comparison about the related ϵ-based learned index methods and data layout optimization-based methods. • Sec.B: the details of the binary search and exponential search, and the connections between prediction error and these specific searching strategies.• Sec.C: the notations adopted in this paper.• Sec.D: the discussion about how the proposed framework inherits the good abilities of existing learned index methods. • Sec.E: the full proof of Theorem 1.• Sec.F: the analysis about the learned slopes of other ϵ-bounded methods.• Sec.G: the summarized algorithm of the proposed method.• Sec.H: the implementation details of experiments.• Sec.I: the detailed descriptions and visualization of the adopted datasets.• Sec.J: more experimental results including the overall index performance and ablation study on other datasets and methods (Sec.J.1 and Sec.J.2), and the theoretical validation (Sec.J.3). Besides, we explore an indicative quantity (the CV value) to provide further insight into the rationale of the proposed framework (Sec.J.4).

A MORE DETAILS ABOUT RELATED WORKS

A.1 ϵ-BOUNDED LINEAR APPROXIMATION METHODS Besides the MET method mentioned in Sec.2, we give more introduction to the related ϵ-bounded linear approximation methods. FITing-Tree (Galakatos et al., 2019) uses a greedy shrinking cone algorithm. PGM (Ferragina & Vinciguerra, 2020b ) adopts another one-pass algorithm that achieves the optimal number of learned segments. Radix-Spline (Kipf et al., 2020) introduces a radix structure to organize learned segments. Here we use a toy dataset to demonstrate the workflow of the ϵ-bounded linear approximation methods. Suppose we study every segment with a demonstrative method in an online manner, which is simplified from MET and expresses the general idea of the class of the ϵ-bounded linear approximation methods: When a new data point comes over, we connect it with the starting point of the segment as the linear function of this segment, and check whether there is a data point whose prediction error is greater than ϵ. If so, the learning of this segment is terminated, and the current data point will serve as the starting point for a new segment. The first two subfigures in Figure 5 show the learning process of the first segment, and the data point which terminates the process is taken as the starting point of the second segment. Subsequent subfigures show the learning process for the following segments just as the first segment. Existing works are different in determining linear functions and termination conditions, but they all follow a similar flow like this.Figure 5 : Workflow of ϵ-bounded linear approximation method.However, existing methods constrain all learned segments with the same ϵ. All of these piece-wise segments based approaches attempt to improve performance by changing the way segments are learned or organized, but ignore the optimization potential of dynamically varying ϵ. In this paper, we discuss the impact of ϵ in more depth and investigate how to enhance existing learned index methods from a new perspective: dynamic adjustment of ϵ accounting for the diversity of different

