LEARNING-BASED SUPPORT ESTIMATION IN SUBLINEAR TIME

Abstract

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log 2 (1/ε) • n/ log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to We evaluate the proposed algorithms on a collection of data sets, using the neuralnetwork based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

1. INTRODUCTION

Estimating the support size of a distribution from random samples is a fundamental problem with applications in many domains. In biology, it is used to estimate the number of distinct species from experiments (Fisher et al., 1943) ; in genomics to estimate the number of distinct protein encoding regions (Zou et al., 2016) ; in computer systems to approximate the number of distinct blocks on a disk drive (Harnik et al., 2016) , etc. The problem has also applications in linguistics, query optimization in databases, and other fields. Because of its wide applicability, the problem has received plenty of attention in multiple fieldsfoot_0 , including statistics and theoretical computer science, starting with the seminal works of Good and Turing Good (1953) and Fisher et al. (1943) . A more recent line of research pursued over the last decade (Raskhodnikova et al., 2009; Valiant & Valiant, 2011; 2013; Wu & Yang, 2019) focused on the following formulation of the problem: given access to independent samples from a distribution



A partial bibliography from contains over 900 references. It is available at https://courses.cit.cornell.edu/jab18/bibliography.html.1

