LEARNING-BASED SUPPORT ESTIMATION IN SUBLINEAR TIME

Abstract

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log 2 (1/ε) • n/ log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to We evaluate the proposed algorithms on a collection of data sets, using the neuralnetwork based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

1. INTRODUCTION

Estimating the support size of a distribution from random samples is a fundamental problem with applications in many domains. In biology, it is used to estimate the number of distinct species from experiments (Fisher et al., 1943) ; in genomics to estimate the number of distinct protein encoding regions (Zou et al., 2016) ; in computer systems to approximate the number of distinct blocks on a disk drive (Harnik et al., 2016) , etc. The problem has also applications in linguistics, query optimization in databases, and other fields. Because of its wide applicability, the problem has received plenty of attention in multiple fields 1 , including statistics and theoretical computer science, starting with the seminal works of Good and Turing Good (1953) and Fisher et al. (1943) . A more recent line of research pursued over the last decade (Raskhodnikova et al., 2009; Valiant & Valiant, 2011; 2013; Wu & Yang, 2019) focused on the following formulation of the problem: given access to independent samples from a distribution P over a discrete domain {0, . . . , n -1} whose minimum non-zero massfoot_0 is at least 1/n, estimate the support size of P up to ±εn. The state of the art estimator, due to Valiant & Valiant (2011); Wu & Yang (2019) , solves this problem using only O(n/ log n) samples (for a constant ε). Both papers also show that this bound is tight. A more straightforward linear-time algorithm exists, which reports the number of distinct elements seen in a sample of size N = O(n log ε -1 ) (which is O(n) for constant ε), without accounting for the unseen items. This algorithm succeeds because each element i with non-zero mass (and thus mass at least 1/n) appears in the sample with probability at least 1 -(1 -1/n) N > 1 -ε, so in expectation, at most ε • n elements with non-zero mass will not appear in the sample. Thus, in general, the number of samples required by the best possible algorithm (i.e., n/ log n) is only logarithmically smaller than the complexity of the straightforward linear-time algorithm. A natural approach to improve over this bound is to leverage the fact that in many applications, the input distribution is not entirely unknown. Indeed, one can often obtain rough approximations of the element frequencies by analyzing different but related distributions. For example, in genomics, frequency estimates can be obtained from the frequencies of genome regions of different species; in linguistics they can be inferred from the statistical properties of the language (e.g., long words are rare), or from a corpus of writings of a different but related author, etc. More generally, such estimates can be learned using modern machine learning techniques, given the true element frequencies in related data sets. The question then becomes whether one can utilize such predictors for use in support size estimation procedures in order to improve the estimation accuracy.

Our results

In this paper we initiate the study of such "learning-based" methods for support size estimation. Our contributions are both theoretical and empirical. On the theory side, we show that given a "good enough" predictor of the distribution P, one can solve the problem using much fewer than n/ log n samples. Specifically, suppose that in the input distribution P the probability of element i is p i , and that we have access to a predictor Π(i) such that Π(i) ≤ p i ≤ b • Π(i) for some constant approximation factor b ≥ 1.foot_1 Then we give an algorithm that estimates the support size up to ±εn using only log(1/ε) • n 1-Θ(1/ log(1/ε)) samples, assuming the approximation factor b is a constant (see Theorem 1 for a more detailed bound). This improves over the bound of Wu & Yang (2019) for any fixed values of the accuracy parameter ε and predictor quality factor b. Furthermore, we show that this bound is almost tight. Our algorithm is presented in Algorithm 1. On a high level, it partitions the range of probability values into geometrically increasing intervals. We then use the predictor to assign the elements observed in the sample to these intervals, and produce a Wu-Yang-like estimate within each interval. 



This constraint is naturally satisfied e.g., if the distribution P is an empirical distribution over a data set of n items. In fact, in this case all probabilities are multiples of 1/n so the support size is equal to the number of distinct elements in the data set. Our results hold without change if we modify the assumption to r • Π(i) ≤ pi ≤ r • b • Π(i), for any r > 0. We use r = 1 for simplicity.



Specifically, our estimator is based on Chebyshev polynomials (as inValiant & Valiant (2011); Wu  & Yang (2019)), but the finer partitioning into intervals allows us to use polynomials with different, carefully chosen parameters. This leads to significantly improved sample complexity if the predictor is sufficiently accurate.On the empirical side, we evaluate the proposed algorithms on a collection of real and synthetic data sets. For the real data sets (network traffic data and AOL query log data) we use neural-network based predictors fromHsu et al. (2019). Although those predictors do not always approximate the true distribution probabilities up to a small factor, our experiments nevertheless demonstrate that the new algorithm offers substantial improvements (up to 3x reduction in relative error) in the estimation accuracy compared to the state of the art algorithm of Wu & Yang (2019).1.1 RELATED WORKEstimating support size As described in the introduction, the problem has been studied extensively in statistics and theoretical computer science. The best known algorithm, due to Wu & Yang

