PARTITIONED LEARNED BLOOM FILTER

Abstract

Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set, and may return false positives. Recently, variations referred to as learned Bloom filters were developed that can provide improved performance in terms of the rate of false positives, by using a learned model for the represented set. However, previous methods for learned Bloom filters do not take full advantage of the learned model. Here we show how to frame the problem of optimal model utilization as an optimization problem, and using our framework derive algorithms that can achieve near-optimal performance in many cases. Experimental results from both simulated and real-world datasets show significant performance improvements from our optimization approach over both the original learned Bloom filter constructions and previously proposed heuristic improvements.

1. INTRODUCTION

Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set [Bloom (1970) ]. A Bloom filter compresses a given set S into an array of bits. A Bloom filter may allow false positives, but will not give false negative matches, which makes them suitable for numerous memory-constrained applications in networks, databases, and other systems areas. Indeed, there are many thousands of papers describing applications of Bloom filters [Dayan et al. (2018) , Dillinger & Manolios (2004) , Broder & Mitzenmacher (2003) ]. There exists a trade off between the false positive rate and the size of a Bloom filter (smaller false positive rate leads to larger Bloom filters). For a given false positive rate, there are known theoretical lower bounds on the space used [Pagh et al. (2005) ] by the Bloom filter. However, these lower bounds assume the Bloom filter could store any possible set. If the data set or the membership queries have specific structure, it may be possible to beat the lower bounds in practice [Mitzenmacher (2002 ), Bruck et al. (2006 ), Mitzenmacher et al. (2020) ]. In particular, [Kraska et al. (2018) ] and [Mitzenmacher (2018) ] propose using machine learning models to reduce the space further, by using a learned model to provide a suitable pre-filter for the membership queries. This allows one to beat the space lower bounds by leveraging the context specific information present in the learned model. Rae et al. (2019) propose a neural Bloom Filter that learns to write to memory using a distributed write scheme and achieves compression gains over the classical Bloom filter. The key idea of learned Bloom filters is that in many practical settings, given a query input, the likelihood that the input is in the set S can be deduced by some observable features which can be captured by a machine learning model. For example, a Bloom filter that represents a set of malicious URLs can benefit from a learned model that can distinguish malicious URLs from benign URLs. This model can be trained on URL features such as length of hostname, counts of special characters, etc. This approach is described in [Kraska et al. (2018) ], which studies how standard index structures can be improved using machine learning models; we refer to their framework as the original learned Bloom filter, Given an input x and its features, the model outputs a score s(x) which is supposed to correlate with the likelihood of the input being in the set. Thus, the elements of the set, or keys, should have a higher score value compared to non-keys. This model is used as a pre-filter, so when score s(x) of an input x is above a pre-determined threshold t, it is directly classified as being in the set. For inputs where s(x) < t, a smaller backup Bloom filter built from only keys with a score below the threshold (which are known) is used. This maintains the property that there are no false negatives. The design essentially uses the model to immediately answer for inputs with high score whereas the rest of the inputs are handled by the backup Bloom filter as shown in Fig. 1(A) . The threshold value t is used to partition the space of scores into two regions, with inputs being processed differently depending on in which region its score falls. With a sufficiently accurate model, the size of the backup Bloom filter can be reduced significantly over the size of a standard Bloom filter while maintaining overall accuracy. [Kraska et al. (2018) ] showed that, in some applications, even after taking the size of the model into account, the learned Bloom filter can be smaller than the standard Bloom filter for the same false positive rate. The original learned Bloom filter compares the model score against a single threshold, but the framework has several drawbacks. Choosing the right threshold: The choice of threshold value for the learned Bloom filter is critical, but the original design uses heuristics to determine the threshold value. Using more partitions: Comparing the score value only against a single threshold value wastes information provided by the learning model. For instance, two elements x 1 , x 2 with s(x 1 ) >> s(x 2 ) > t, are treated the same way but the odds of x 1 being a key are much higher than for x 2 . Intuitively, we should be able to do better by partitioning the score space into more than two regions. Optimal Bloom filters for each region: Elements with scores above the threshold are directly accepted as keys. A more general design would provide backup Bloom filters in both regions and choose the Bloom filter false positive rate of each region so as to optimize the space/false positive trade-off as desired. The original setup can be interpreted as using a Bloom filter of size 0 and false positive rate of 1 above the threshold. This may not be the optimal choice; moreover, as we show, using different Bloom filters for each region(as shown in Fig. 1(C )) allows further gains when we increase the number of partitions. Follow-up work by [Mitzenmacher (2018) ] and [Dai & Shrivastava (2019) ] improve on the original design but only address a subset of these drawbacks. In particular, [Mitzenmacher (2018)] proposes using Bloom filters for both regions and provides a method to find the optimal false positive rates for each Bloom filter. But [Mitzenmacher (2018) ] only considers two regions and does not consider how to find the optimal threshold value. [Dai & Shrivastava (2019) ] propose using multiple thresholds to divide the space of scores into multiple regions, with a different backup Bloom filter for each score region. The false positive rates for each of the backup Bloom filters and the threshold values are chosen using heuristics. Empirically, we found that these heuristics might perform worse than [Mitzenmacher (2018) ] in some scenarios. A general design that resolves all the drawbacks would, given a target false positive rate and the learned model, partition the score space into multiple regions with separate backup Bloom filters for each region, and find the optimal threshold values and false positive rates, under the goal of minimizing the memory usage while achieving the desired false positive rate as shown in Fig. 1(C) . In this work, we show how to frame this problem as an optimization problem, and show that our resulting solution significantly outperforms the heuristics used in previous works. Additionally, we show that our maximum space savingfoot_0 is linearly proportional to the KL divergence of the key and non-key score distributions determined by the partitions. We present a dynamic programming algorithm to find the optimal parameters (up to the discretization used for the dynamic programming) and demonstrate performance improvements over a synthetic dataset and two real world datasets: URLs and EMBER. We also show that the performance of the learned Bloom filter improves with increasing number of partitions and that in practice a small number of regions (≈ 4 -6) suffices to get a very good performance. We refer to our approach as a partitioned learned Bloom filter (PLBF). Experimental results from both simulated and real-world datasets show significant performance improvements. We show that to achieve a false positive rate of 0.001, [Mitzenmacher (2018) ] uses



space saved by using our approach instead of a Bloom filter

