SCALABLE SUBSET SAMPLING WITH NEURAL CONDI-TIONAL POISSON NETWORKS

Abstract

A number of problems in learning can be formulated in terms of the basic primitive of sampling k elements out of a universe of n elements. This subset sampling operation cannot directly be included in differentiable models, and approximations are essential. Current approaches take an order sampling approach to sampling subsets and depend on differentiable approximations of the Top-k operator for selecting the largest k elements from a set. We present a simple alternative method for sampling subsets based on conditional Poisson sampling. Unlike order sampling approaches, the complexity of the proposed method is independent of the subset size, which makes the method scalable to large subset sizes. We adapt the procedure to make it efficient and amenable to discrete gradient approximations for use in differentiable models. Furthermore, the method allows the subset size parameter k to be differentiable. We validate our approach extensively, on image and text model explanation, image subsampling and stochastic k-nearest neighbor tasks outperforming existing methods in accuracy, efficiency and scalability.

1. INTRODUCTION

The fundamental combinatorial operation of selecting subsets of elements from a given universe is ever increasingly being incorporated in differentiable neural models due to its range of applicability. Example applications include model explanations (Chen et al., 2018) , sequence modeling (Kool et al., 2019) , point cloud modeling (Yang et al., 2019) , and nearest neighbor networks (Grover et al., 2018) . Current neural network approaches for sampling subsets generally fall in the class of order sampling methods. In the order sampling scheme, each element in the universe is assigned an independent ranking random variable. To obtain a subset sample of size k, the largest (or smallest) k elements are chosen. Thereby, the ranking variable distribution induces a probability distribution over the possible subsets. However, the operation of choosing the largest k elements (Top-k) is naturally not differentiable, since it is a discrete operation. This means that the Top-k procedure cannot be directly used in gradient learning models. This has led to a number of proposals of relaxed and differentiable versions of the Top-k operator (Goyal et al., 2018; Pietruszka et al., 2021; Plötz & Roth, 2018) . Building on Top-k approaches several methods of sampling subsets as k-hot vectors have appeared in the literature (Paulus et al., 2020; Xie & Ermon, 2019) . In this paper, we explore Poisson sampling (Tillé, 2006) and conditional Poisson sampling (Hájek & Dupač, 1981) as an alternative to order sampling for subsets. With Poisson sampling, each element in the set is independently drawn to be selected for the subset or not. As these independent trials cannot guarantee a fixed size for subsets, with conditional Poisson sampling, the Poisson sampling procedure is conditioned to return subsets of exactly k elements. In practice, the conditioning amounts to repeating the Poisson sampling procedure until a subset of size k is obtained. The general (conditional) Poisson sampling approach has a number of features which make it an attractive alternative to Top-k-based order sampling methods. Firstly, the sampling is done independently in Poisson sampling methods, which makes the procedure very efficient for sampling subsets with large values of k. By contrast, current Top-k methods (Goyal et al., 2018; Plötz & Roth, 2018) often have an inner loop depending on k, which makes them expensive for sampling large subsets in terms of both time and memory. Furthermore, computations in modern neural network models are vectorized. This makes sampling different subset size k for different elements in a batch difficult for current Top-k procedures, since the number of sampling iterations to obtain the Top-k elements depend on k. With Poisson sampling, it is trivial to sample different subset sizes for batched inputs, making it ideally suited to vectorized computation. Finally, with Top-k, the subset size k itself is not differentiable. With Poisson sampling, k appears as a scaling parameter for the probabilities of the individual elements in the universe. Therefore, the subset size parameter k can easily be incorporated in differentiable computations when having a differentiable sampling procedure. Despite the aforementioned advantages, there are two difficulties with Poisson sampling. The first is that vanilla Poisson sampling can lead to large variance in the sampled subset size. This can be resolved with conditional Poisson sampling to obtain exact samples, but only at the cost of high computational complexity. The second (and main) difficulty is that neither Poisson sampling nor its conditional variant are differentiable and cannot be directly included in differentiable models. In this paper, and in the context of differentiable subset sampling with neural networks, we propose neural conditional Poisson subset sampling. We note that often we do not need subsets of k elements exactly, as conditional Poisson sampling would have us do, and instead sampling k-sized subsets in expectation is enough. With neural conditional Poisson subset sampling, we relax the constraint of sampling exactly k elements, thus allowing to trade off accuracy in the subset size for efficiency of sampling large subsets. Compared to Top-k approaches for sampling subsets (Xie & Ermon, 2019), neural conditional Poisson subset sampling allows for efficient sampling of large subsets, easy choice of per-instance subset sizes, and differentiable subset sizes for a small loss in subset size accuracy, when an exact number of elements in the sampled subsets is not a necessity. Secondly, we adapt the sampling procedure so that gradient approximations for discrete variables are applicable. The resulting method is scalable and can be used to sample large subsets even from image-size domains in full resolution -a task that is to date infeasible for current subset sampling methods.

2. PRELIMINARIES

Let U = {1, 2 . . . , n} denote a universe consisting of n elements. Each element i ∈ U is assigned a "size" p i ∈ (0, 1). We assume the sizes here to be normalized to the unit interval. Let x denote a subset of the elements in U represented as an indicator vector of size n, x = (x 1 , . . . , x n ), where x i ∈ {0, 1} and x i = 1 if the ith element is included in the subset. The sample size is the number of elements, i.e., x i , in the chosen subset. In this paper, we are concerned with sampling subsets of given size k from a universe of n elements. A sampling design (Tillé, 2006) , S, is a way to assign a probability to each subset of universe U , i.e., S : P(U ) → [0, 1], where P(U ) is the power set. Intuitively, a sampling procedure induces a sampling design by assigning each subset with the probability with which the subset is chosen. Conversely, there could be a number of sampling procedures that correspond to the same sampling design. Occasionally, a sampling procedure is also referred to as a sampling design. Inclusion Probability. Important parameters of a sampling design are the inclusion probabilities. The first order inclusion probability of an element i is the marginal probability, over the space of samples, that i is included in the sample. If I i is an indicator variable where I i = 1 when i is included, the i-th inclusion probability is π end if 6: end for Poisson sampling (Tillé, 2006 ) is a probabilityproportional-to-size sampling design for sampling without replacement. This means that each element i is included in the sample with probability proportional to its size p i , where we assume that i p i = 1. With Poisson sampling, the sample size is a random variable with expected size k. Given independent uniform random variables u i ∼ U(0, 1), an element i is included in the sample if u i ≤ kp i (see Algo- i := E[I i ].

