APPROXIMATING ANY FUNCTION VIA CORESET FOR RADIAL BASIS FUNCTIONS: TOWARDS PROVABLE DATA SUBSET SELECTION FOR EFFICIENT NEURAL NETWORKS TRAINING

Abstract

Radial basis function neural networks (RBFNN) are well-known for their capability to approximate any continuous function on a closed bounded set with arbitrary precision given enough hidden neurons. Coreset is a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, etc.). In this paper, we suggest the first coreset construction algorithm for RBFNNs, i.e., a small weighted subset which approximates the loss of the input data on any radial basis function network and thus approximates any function defined by an RBFNN on the big input data. This is done by constructing coresets for radial basis and Laplacian loss functions. We use our coreset to suggest a provable data subset selection algorithm for training deep neural networks, since our coreset approximates every function, it should approximate the gradient of each weight in a neural network as it is defined as a function on the input. Experimental results on function approximation and dataset subset selection on popular network architectures and data sets are presented, demonstrating the efficacy and accuracy of our coreset construction.

1. INTRODUCTION

Radial basis function neural networks (RBFNNs) are artificial neural networks that generally have three layers: an input layer, a hidden layer with a radial basis function (RBF) as an activation function, and a linear output layer. In this paper, the input layer receives a d-dimensional vector x ∈ R d of real numbers. The hidden layer then consists of various nodes representing RBFs, to compute ρ(∥x-c i ∥ 2 ) := exp -∥xc i ∥ 2 2 , where c i ∈ R d is the center vector for neuron i across, say, N neurons in the hidden layer. The linear output layer then computes N i=1 α i ρ(∥x -c i ∥ 2 ), where α i is the weight of neuron i in the linear output neuron. Therefore, RBFNNs are feed-forward neural networks because the edges between the nodes do not form a cycle, and enjoy advantages such as simplicity of analysis, faster training time, and interpretability, compared to alternatives such as convolutional neural networks (CNNs) and even multi-layer perceptrons (MLPs) (Padmavati, 2011) . Function approximation via RBFNNs. RBFNNs are universal approximators in the sense that an RBFNN with a sufficient number of hidden neurons (large N ) can approximate any continuous function on a closed, bounded subset of R d with arbitrary precision (Park & Sandberg, 1991) , i.e., given a sufficiently large input set P of n points in R d and given its corresponding label function y : P → R, an RBFNN, can be trained to approximate the function y. Therefore, RBFNNs are commonly used across a wide range of applications, such as function approximation (Park & Sandberg, 1991; 1993; Lu et al., 1997) , time series prediction (Whitehead & Choate, 1996; Leung et al., 2001; Harpham & Dawson, 2006) , classification (Leonard & Kramer, 1991; Wuxing et al., 2004; Babu & Suresh, 2012) , and system control (Yu et al., 2011; Liu, 2013) , due to their faster learning speed. For a given size of RBFNN (number of neurons in the hidden layer) and an input set, the aim of this paper is to compute a small weighted subset that approximates the loss of the input data on any radial basis function neural network of this size and thus approximates any function defined (approximated) by such an RBFNN on the big input data. This small weighted subset is denoted by coreset. Coresets. Usually, in machine/deep learning, we are given input set P ⊆ R d of n points, its corresponding weights function w : P → R, a set of queries X (a set of candidate solutions for the involved optimization problem), and a loss function f : P × X → [0, ∞). The tuple (P, w, X, f ) is called query space, and it defines the optimization problem at hand -where usually, the goal is to find x * ∈ arg min x∈X p∈P w(p)f (p, x). Given a query space (P, w, X, f ), a coreset for it is a small weighted subset of the input P that can provably approximate the cost of every query x ∈ X on P (Feldman, 2020) ; see Definition 1. In particular, a coreset for a RBFNN can approximate the cost of an RBFNN on the original training data for every set of centers and weights that define the RBFNN (see Section 4). Hence, the coreset approximates also the centers and weights that form the optimal solution of the RBFNN (the solution that approximates the desired function). Thus a coreset for a RBFNN would facilitate training data for function approximation without reading the full training data and more generally, a strong coreset for an RBFNN with enough hidden neurons would give a strong coreset for any function that can be approximated to some precision using the RBFNN. To this end, in this paper, we aim to provide a coreset for RBFNNs, and thus provably approximating (providing a coreset to) any function that can be approximated by a given RBFNN. Furthermore, we can use this small weighted subset (coreset) to suggest a provable data subset selection algorithm for training deep neural networks efficiently (on the small subset), since our coreset approximates every function that can be approximated by an RBFNN of this size, it should approximate the gradient of each weight in a neural network (if it can be approximated by the RBFNN). Training neural networks on data subset. Although deep learning has become widely successful with the increasing availability of data (Krizhevsky et al., 2017; Devlin et al., 2019) , modern deep learning systems have correspondingly increased in their computational resources, resulting in significantly larger training times, financial costs (Sharir et al., 2020) , energy costs (Strubell et al., 2019) , and carbon footprints (Strubell et al., 2019; Schwartz et al., 2020) . Data subset selection (coresets) allows for efficient learning at several levels (Wei et al., 2014; Kaushal et al., 2019; Coleman et al., 2019; Har-Peled & Mazumdar, 2004; Clarkson, 2010) . By employing a significantly smaller subset of the big dataset, (i) we enable learning on relatively low resource computing settings without requiring a huge number of GPU and CPU servers, (ii) we may greatly optimize the end-to-end turnaround time, which frequently necessitates many training runs for hyper-parameter tweaking, and (iii) because a large number of deep learning trials must be done in practice, we allow for considerable reductions in deep learning energy usage and CO 2 emissions (Strubell et al., 2019) . Several efforts have recently been made to improve the efficiency of machine learning models using data subset selection (Mirzasoleiman et al., 2020a; Killamsetty et al., 2021b; a) . However, existing techniques either (i) employ proxy functions to choose data points, (ii) are specialized to specific machine learning models, (iii) use approximations of parameters such as gradient error or generalization errors, (iv) lack provable guarantees on the approximation error, or (v) require an inefficient gradient computation of the whole data. Most importantly, all of these methods are model/network dependent, and thus computing the desired subset of the data after several training epochs (for the same network) takes a lot of time and must be repeated each time the network changes. To this end, in this paper, we aim at suggesting a provable and efficient model-independent subset selection algorithm for training neural networks. This will allow us to compute a subset of the training data, that is guaranteed to be a coreset for training several neural network architectures/models.

1.1. OUR CONTRIBUTIONS

In this paper, we suggest a coreset that approximates any function can be represented by an RBFNN architecture. Specifically: (i) We provide a coreset for the RBF and Laplacian cost functions; see Algorithm 1, and Section 3.1.2. (ii) We employ (i) to generate a coreset for any RBFNN model, in turn, approximating any function that can represented by the RBFNN; see Figure 1 for illustration, and Section 4 for more details. (iii) We then exploit the properties of RBFNNs, to approximate the gradients of any deep neural networks (DNNs), leading towards provable subset selection for learning/training DNNs. We also show the advantages of using our coreset against previous subset selection techniques; see Section 5 and Section 6. (iv) Finally, we provide an open source code implementation of our algorithm for reproducing our results and future research.

1.2. RELATED WORK

A long line of active work has studied efficient coreset constructions for various problems in computational geometry and machine learning, such as k-means and k-median clustering (Har-Peled & Mazumdar, 2004; Chen, 2009; Braverman et al., 2016; Huang & Vishnoi, 2020; Jubran et al., 2020; Cohen-Addad et al., 2022) , regression (Dasgupta et al., 2008; Chhaya et al., 2020; Tolochinsky et al., 2022; Meyer et al., 2022) , low-rank approximation (Cohen et al., 2017; Braverman et al., 2020; Maalouf et al., 2020; 2021) , volume maximization (Indyk et al., 2020; Mahabadi et al., 2020; Woodruff & Yasuda, 2022) , projective clustering (Feldman et al., 2020; Tukan et al., 2022b) , support vector machines (SVMs) (Clarkson, 2010; Tukan et al., 2021) , Bayesian inference (Campbell & Broderick, 2018) , sine wave fitting (Maalouf et al., 2022) , and decision tree (Jubran et al., 2021) . (Baykal et al., 2022) suggested coreset-based algorithms for compressing the parameters of a trained fully-connected neural network by using sensitivity sampling on the weights of neurons after training, though without pruning full neurons. (Mussay et al., 2020; Liebenwein et al., 2019; Tukan et al., 2022a) sidestepped this issue by identifying the neurons that can be compressed regardless of their weights, due to the choice of the activation functions, thereby achieving coreset-based algorithms for neural pruning. These approaches use coresets to achieve an orthogonal goal to data subset selection in the context of deep learning -they greatly reduce the number of neurons in the network while we greatly reduce the number of samples in the dataset that need to be read by the neural network. Correspondingly, we reduce the effective size of the data that needs to be stored or even measured prior to the training stage. Moreover, we remark that even if the number of inputs to the input layer were greatly reduced by these neural compression approaches, the union of the inputs can still consist of the entire input dataset and so these approaches generally cannot guarantee any form of data distillation. Toward the goal of data subset selection, (Mirzasoleiman et al., 2020a; b) introduced algorithms for selecting representative subsets of the training data to accurately estimate the full gradient for tasks in both deep learning and classical machine learning models such as logistic regression and these approaches were subsequently refined by (Killamsetty et al., 2021a; b) . Data distillation has also received a lot of attention in image classification (Bohdal et al., 2020; Nguyen et al., 2021; Dosovitskiy et al., 2021) , natural language processing (Devlin et al., 2019; Brown et al., 2020) , and federated learning (Ozkara et al., 2021; Zhu et al., 2021) .

2. PRELIMINARIES

For an integer n > 0, we use [n] to denote the set {1, 2, . . . , n}. A weighted set of points is a pair (P, w), where P ⊆ R d is a set of points and w : P → [0, ∞) is a weight function. We now formally provide the notion of ε-coreset for the RBF loss. This will be later extended to a coreset for RBFNN. Definition 1 (RBF ε-coreset). Let (P, w) be a weighted of n points in R d , X ⊆ R d be a set of queries, ε ∈ (0, 1). For every x ∈ X and p ∈ P let f (p, x) := exp -∥p -x∥ 2 2 denote the RBF loss function between p and x. An ε-coreset for (P, w) with respect to f , is a pair (S, v) where S ⊆ P , v : S → (0, ∞) is a weight function, such that for every x ∈ X, 1 -q∈S v(q)f (q,x) p∈P w(p)f (p,x) ≤ ε. We say the RBF coreset is strong if it guarantees correctness over all x ∈ X. Otherwise, we say the coreset is weak if it only provides guarantees for all x only in some subset of X. Sensitivity sampling. To compute our RBF ε-coreset, we utilize the sensitivity sampling framework (Braverman et al., 2016) . In short, the sensitivity of a point p ∈ P corresponds to the "importance" of this point with respect to the other points and the problem at hand. In our context (with respect to the RBF loss), the sensitivity is defined as s(p) = sup x∈X w(p)f (p,x) q∈P w(q)f (q,x) , where the denominator is nonzero. Once we bound the sensitivities for every p ∈ P , we can sample points from P according to their corresponding sensitivity bounds, and re-weight the sampled points to obtain an RBF ε-coreset as in Definition 1. The size of the sample (coreset) is proportional to the sum of these bounds -the tighter (smaller) these bounds, the smaller the coreset size. For formal details, we refer the reader to Section A in the appendix. Sensitivity bounding. We now present our main tool for bounding the sensitivity of each input point with respect to the RBF and Laplacian loss functions. Definition 2 (Special case of Definition 4 (Tukan et al., 2020) ). Let P, w, R d , f be query space as in Definition 9 where for every p ∈ P and x ∈ R d , f (p, x) = p T x . Let D ∈ [0, ∞) d×d be a diagonal matrix of full rank and let V ∈ R d×d be an orthogonal matrix, such that for every x ∈ R d , DV T x 2 ≤ p∈P w(p) p T x ≤ √ d DV T x 2 . Define U : P → R d such that U (p) = p DV T -1 for every p ∈ P . The tuple (U, D, V ) is the ∥•∥ 1 -SVD of P . Using the above tool, the sensitivity w.r.t. the RBF loss function can be bounded using the following. Lemma 3 (Special case of Lemma 35, (Tukan et al., 2020) ). Let P, w, R d , f be query space as in Definition 9 where for every p ∈ P and x ∈ R d , f (p, x) = p T x . Let (U, D, V ) be the ∥•∥ 1 -SVD of (P, w) with respect to |•| (see Definition 2). Then, claims (i) -(ii) hold as follows: (i) for every p ∈ P , the sensitivity of p with respect to the query space (P, w, R d , |•|) is bounded by s(p) ≤ ∥U (p)∥ 1 , (ii) and the total sensitivity is bounded by p∈P s(p) ≤ d 1.5 .

3. METHOD

In this section, we provide coresets for the Gaussian and Laplacian loss functions. We detail our coreset construction for the Gaussian loss function and Laplacian loss function in Section 3.1.2. Overview of Algorithm 1. Algorithm 1 receives as input, a set P of n points in R d , a weight function w : P → [0, ∞), a bound R on the radius of the ball containing query space X, and a sample size m > 0. If the sample size m is sufficiently large, then Algorithm 1 outputs a pair (S, v) that is an ε-coreset for RBF cost function; see Theorem 6. First d ′ is set to be the VC dimension of the the quadruple (P, w, X, ρ (•)); see Definition 10. The heart of our algorithm lies in formalizing the RBF loss function as a variant of regression problem, specifically, a variant of the ℓ 1 -regression problem. The conversion requires manipulation of the input data as presented at Line 2. We then compute the f -SVD of the new input data with respect to the ℓ 1 -regression problem followed by bounded the sensitivity of such points (Lines 3-5). Now we have all the needed ingredients to use Theorem 11 in order to obtain an ε-coreset, i.e., we sample i.i.d m points from P based on their sensitivity bounds (see Line 8), followed by assigning a new weight for every sampled point at Line 9. Algorithm 1: CORESET(P, w, R, m) ) , a bound on radius of the query space X, and a sample size m ≥ 1. Output: A pair (S, v) that satisfies Theorem 6. Input: A set P ⊆ R d of n points, a weight function w : P → [0, ∞ Set d ′ := the VC dimension of quadruple (P, w, X, ρ (•)) // See Definition 10 Set P ′ = q p = ∥p∥ 2 2 , -2p T , 1 T | ∀p ∈ P Set (U, D, V ) to be the f -SVD of (P ′ , w, |•|) // See Definition 2 for every p ∈ P do Set s(p) := e 12R 2 1 + 8R 2 w(p) q∈P w(q) + w(p) ∥U (q p )∥ 1 // bound on the sensitivity of p as in Lemma 13 in the appendix Set t := p∈P s(p) Set c ≥ 1 to be a sufficiently large constant // Can be determined from Theorem 6 Pick an i.i.d sample S of m points from P , where each p ∈ P is sampled with probability s(p) t . set v : R d → [0, ∞] to be a weight function such that for every q ∈ S, v(q) = t s(q)•m . return (S, v)

3.1.1. LOWER BOUND ON THE CORESET SIZE FOR THE GAUSSIAN LOSS FUNCTION

We first show that the lower bound on the size of coresets, to emphasize the need for assumptions on the data and the query space. Theorem 4. There exists a set of n points P ⊆ R d such that p∈P s(p) = Ω(n). Proof. Let d ≥ 3 and let P ⊆ R d be a set of n points distributed evenly on a 2 dimensional sphere of radius ln n 2 cos ( π n ) . In other words, using the law of cosines, every p ∈ P , √ ln n = min q∈P \{p} ∥p -q∥ 2 ; see Figure C. Observe that for every p ∈ P , s(p) := max x∈R d e -∥p-x∥ 2 2 q∈P e -∥q-x∥ 2 2 ≥ e -∥p-p∥ 2 2 q∈P e -∥p-q∥ 2 2 = 1 1 + q∈P \{p} e -∥p-q∥ 2 2 ≥ 1 1 + q∈P \{p} 1 n ≥ 1 2 , (1) where the first equality holds by definition of the sensitivity, the first inequality and second equality hold trivially, the second inequality follows from the assumption that √ ln n ≤ min q∈P \{p} ∥p -q∥ 2 , and finally the last inequality holds since q∈P \{p} 1 n ≤ 1.

3.1.2. REASONABLE ASSUMPTIONS LEAD TO EXISTENCE OF CORESETS

Unfortunately, it is not immediately straightforward to bound the sensitivities of either the Gaussian loss function or the Laplacian loss function. Therefore, we first require the following structural properties in order to relate the Gaussian and Laplacian loss functions into more manageable quantities. We shall ultimately relate the function e -|p T x| to both the Gaussian and Laplacian loss functions. Thus, we first relate the function e -|p T x| to the function |p T x| + 1. Claim 5. Let p ∈ R d such that ∥p∥ 2 ≤ 1, and let R > 0 be positive real number. Then for every x ∈ x ∈ R d ∥x∥ 2 ≤ R , 1 e R (1 + R) 1 + p T x ≤ e -|p T x| ≤ p T x + 1. In what follows, we provide the analysis of coreset construction for the RBF and Laplacian loss functions, considering an input set of points lying in the unit ball. We refer the reader to the supplementary material for generalization of our approaches towards general input set of points. Theorem 6 (Coreset for RBF). Let R ≥ 1 be a positive real number, X = x ∈ R d ∥x∥ 2 ≤ R , and let ε, δ ∈ (0, 1). Let (P, w, X, f ) be query space as in Definition 9 such that every p ∈ P satisfies ∥p∥ 2 ≤ 1. For every x ∈ X and p ∈ P , let f (p, x) := ρ (∥p -x∥ 2 ). Let (S, v) be a call to CORESET(P, w, R, m) where S ⊆ P and v : S → [0, ∞). Then (S, v) ε-coreset of (P, w) with probability at least 1 -δ, if m = O e 12R 2 R 2 d 1.5 ε 2 R 2 + log d + log 1 δ .

Coreset for Laplacian loss function.

In what follows, we provide a coreset for the Laplacian loss function. Intuitively speaking, leveraging the properties of the Laplacian loss function, we were able to construct a coreset that holds for every vector x ∈ R d unlike the RBF case where the coreset holds for a ball of radius R. We stress out that the reason for this is due to the fact that the Laplacian loss function is less sensitive than the RBF. Theorem 7 (Coreset for the Laplacian loss function). Let P, w, R d , f be query space as in Definition 9 such that every p ∈ P satisfies ∥p∥ 2 ≤ 1. For x ∈ R d and p ∈ P , let f (p, x) := e -∥p-x∥ 2 . Let ε, δ ∈ (0, 1). Then there exists an algorithm which given P, w, ε, δ return a weighted set (S, v) where S ⊆ P of size O √ nd 1.25 ε 2 log n + log d + log 1 δ and a weight function v : S → [0, ∞) such that (S, v) is an ε-coreset of (P, w) with probability at least 1δ.

4. RADIAL BASIS FUNCTION NETWORKS

In this section, we consider coresets for RBFNNs. Consider an RBFNN with L neurons in the hidden layer and a single output neuron. First note that the hidden layer uses radial basis functions as activation functions so that the output is a scalar function of the input layer, ϕ : R d → R defined by ϕ(x) = L i=1 α i ρ(∥x -c (i) ∥ 2 ), where c (i) ∈ R n for each i ∈ [d]. For an input dataset P and a corresponding desired output function y : P → R, RBFNNs aim to optimize p∈P y(p) - L i=1 α i e -∥p-c (i) ∥ 2 2 2 . Expanding the above cost function, we obtain that RBFNNs aim to optimize p∈P y(p) 2 -2 L i=1 α i   p∈P y(p)e -∥p-c (i) ∥ 2 2   α + β p∈P L i=1 α i e -∥p-c (i) ∥ 2 2 2 . (2) Bounding the α term in equation 2. We first define for every x ∈ R d : i) . Thus the α term in equation 2 can be approximated using the following. ϕ + (x) = p∈P,y(p)>0 y(p)e -∥p-x∥ 2 2 ϕ -(x) = p∈P,y(p)<0 |y(p)| e -∥p-x∥ 2 2 . Observe that p∈P y(p)ρ(∥p -c (i) ∥ 2 ) = ϕ + c (i) -ϕ -c ( Theorem 8. There exists an algorithm that samples O e 8R 2 R 2 d 1.5 ε 2 R 2 + log d + log 2 δ points to form weighted sets (S 1 , w 1 ) and (S 2 , w 2 ) such that with probability at least 1 -2δ, p∈P y(p)ϕ(p) -    i∈[L] αi>0 α i p∈S1 w 1 (p)e -∥p-c (i) ∥ 2 2 + j∈[L] αj <0 α j q∈S2 w 2 (q)e -∥q-c (j) ∥ 2 2    i∈[L] αi>0 α i ϕ + c (i) + ϕ -c (i) + i∈[L] αi<0 |α i | ϕ + c (i) + ϕ -c (i) ≤ ε. Bounding the β term in equation 2. By Cauchy's inequality, it holds that p∈P L i=1 α i e -∥p-c (i) ∥ 2 2 2 ≤ L p∈P L i=1 α 2 i e -2∥p-c (i) ∥ 2 2 = L L i=1 α 2 i p∈P e -2∥p-c (i) ∥ 2 2 , where the equality holds by simple rearrangement. Using Theorem 6, we can approximate the upper bound on β with an approximation of L(1 + ε). However, if for every i ∈ [L] it holds that α i ≥ 0, then we also have the lower bound L i=1 α 2 i p∈P e -2∥p-c (i) ∥ 2 2 ≤ p∈P L i=1 α i e -∥p-c (i) ∥ 2 2 2 . Since we can generate a multiplicative coreset for the left-hand side of the above inequality, then we obtain also a multiplicative coreset in a sense for β as well.

5. BENEFIT OF OUR CORESET OVER OTHER SUBSET SELECTION METHODS

One coreset for all networks. Our coreset is a model independent, i.e., we aim at improving the running time of multiple neural networks. Contrary to other method that needs to compute the coreset after each gradient update to support there theoretical proofs, our method gives the advantage of computing the sensitivity (or the coreset) only once, for all of the required networks. This is since our coreset can approximate any function that can be defined (approximated) using a RBFNN model. Efficient coreset per epoch. Practically, our competing methods for data selection are not applied before each epoch, but every R epochs. This is since the competing methods require a lot of time to compute a new coreset since they compute the gradients of the network with respect to each input training data. However, our coreset can be computed before each epoch in a negligible time (∼ 0 seconds), since we compute the sensitivity of each point (image) in the data once at the beginning, and then whenever we need to create a new coreset, we simply sample from the input data according to the sensitivity distribution.

6. EXPERIMENTAL RESULTS

In this section we practically demonstrate the efficiency and stability of our RBFNN coreset approach for training deep neural networks via data subset selection. We mainly study the trade-off between accuracy and efficiency (time/subset size). Competing methods. We compare our method against many variants of the proposed algorithms in Killamsetty et al. (2021a) 2021a)), and finally, a combination of both (ii) and (iii). In other words, the competing methods are GRAD-MATCH, GRAD-MATCHPB, GRAD-MATCH-WARM, GRAD-MATCHPB-WARM, CRAIG, CRAIGPB, CRAIG-WARM, CRAIGPB-WARM, and GLISTER-WARM. We also compare against randomly selecting points (denoted by RANDOM). Datasets and model architecture. We performed our experiments for training CIFAR10 and CI-FAR100 (Krizhevsky et al., 2009) on Resnet18 (He et al., 2016), and MNIST (LeCun et al., 1998) on LeNet. The setting. We adapted the same setting of Killamsetty et al. (2021a) , where we used SGD optimizer for training initial learning rate equal to 0.01, a momentum of 0.9, and a weight decay of 5e -4. We decay the learning rate using cosine annealing (Loshchilov & Hutter, 2016) Subset sizes and the R parameter. For MNIST, we use sizes of {1%, 3%, 5%, 10%}, while for CI-FAR10 and CIFAR100, we use {5%, 10%, 20%, 30%} on ResNet18. Since the competing methods require a lot of time to compute the gradients, we set R = 20. We note that for our coreset we can test it with R = 1 without adding run-time since once the sensitivity vector is defined, commuting a new coreset requires ∼ 0 seconds. However, we test it with R = 20, to show its robustness. Discussion. Table 1 and 2 report the results for CIFAR10 and CIFAR100 on ResNet18, respectively. It is clear from Table 1 that our method achieves the best accuracy (without warm start) for 20% and 30% subset selection of CIFAR10, while for CIFAR100 and smaller subset selection on CIFAR10, our method drastically outperform all of the methods that does not apply warm start (training on the whole data), and we still outperform all of the other methods in terms of accuracy vs time. The same phenomena is witnessed in the MNIST experiment (Table 4 ), where our coreset results are consistently placed among the top methods across all sizes. Furthermore, we note that our sensitivity sampling vector is computed once during our experiments for each dataset. This vector can be used to sampled coresets of different sizes, for different networks, at different epochs of training, in a time that is close to zero seconds. Note that the best results are represented in bold text, while the fastest are underlined. In this section, we compare our coreset to uniform for approximating the linear regression loss function by running RBFNN on the coreset, where the results are averaged across 8 trials, while the shaded regions correspond to the median absolute deviation. It is clear from Figure 2 that our coreset outperforms uniform sampling. Furthermore learning on the coreset was ×30 faster than learning on the whole data. The data set is the 3D spatial networks (Dua et al., 2017) . In this paper, we have introduced a coreset that provably approximates any function that can be represented by a RBFNN architecture. Our coreset construction can be used to approximate the gradients of any deep neural networks (DNNs), leading towards provable subset selection for learning/training DNNs. We also empirically demonstrate the value of our work by showing significantly better performances over various datasets and model architectures. As the first work on using coresets for data subset selection with respect to RBFNNs, our results lead to a number of interesting possible future directions. It is natural to ask whether there exist smaller coreset constructions that also provably give the same worst-case approximation guarantees. Another question is whether our results can be extended to more general classes of loss functions. Finally, we remark that although our empirical results significantly beat state-of-the-art, they nevertheless only serve as a proof-of-concept and have not been fully optimized with additional heuristics.

7. CONCLUSION AND FUTURE WORK

Definition 10 (VC-dimension (Braverman et al., 2016) ). For a query space (P, w, R d , f ) and r ∈ [0, ∞), we define ranges(x, r) = {p ∈ P | w(p)f (p, x) ≤ r} , for every x ∈ R d and r ≥ 0. The dimension of (P, w, R d , f ) is the size |S| of the largest subset S ⊂ P such that S ∩ ranges(x, r) | x ∈ R d , r ≥ 0 = 2 |S| , where |A| denotes the number of points in A for every A ⊆ R d . The following theorem formally describes how to construct an ε-coreset based on the sensitivity sampling framework. Theorem 11 (Restatement of Theorem 5.5 in (Braverman et al., 2016) ). Let P, w, R d , f be a query space as in Definition 9. For every p ∈ P define the sensitivity of p as sup x∈R d w(p)f (p,x) q∈P w(q)f (q,x) , where the sup is over every x ∈ R d such that the denominator is non-zero. Let s : P → [0, 1] be a function such that s(p) is an upper bound on the sensitivity of p. Let t = p∈P s(p) and d ′ be the VC dimension of the triplet P, w, R d , f ; see Definition 10. Let c ≥ 1 be a sufficiently large constant, ε, δ ∈ (0, 1), and let S be a random sample of |S| ≥ ct ε 2 d ′ log t + log 1 δ i.i.d points from P , such that every p ∈ P is sampled with probability s(p)/t. Let v(p) = tw(p) s(p)|S| for every p ∈ S. Then, with probability at least 1δ, (S, v) is an ε-coreset for P with respect to f .

B PROOFS FOR OUR MAIN THEOREMS B.1 PROOF OF CLAIM 5

Claim 5. Let p ∈ R d such that ∥p∥ 2 ≤ 1, and let R > 0 be positive real number. Then for every x ∈ x ∈ R d ∥x∥ 2 ≤ R , 1 e R (1 + R) 1 + p T x ≤ e -|p T x| ≤ p T x + 1. Proof. Put x ∈ x ∈ R d ∥x∥ 2 ≤ R and note that if p T x = 0 then the claim is trivial. Otherwise, we observe that e -|p T x| ≥ 1 e R ≥ 1 + p T x (1 + R) e R .

B.2 PROOF OF THEOREM 6

Claim 12. Let a, b ≥ 0 be pair of nonnegative real numbers and let c, d > 0 be a pair of positive real numbers. Then a + b c + d ≤ a c + b d . Proof. Observe that a + b c + d = a c + d + b c + d ≤ a c + b d where the inequality holds since c, d > 0. Lemma 13 (Sensitivity bound w.r.t. the RBF loss function). Let R ≥ 1 be a positive real number, and let X = x ∈ R d ∥x∥ 2 ≤ R . Let P, w, R d , f be query space as in Definition 9 where for every p ∈ P and x ∈ R d , f (p, x) = e -∥p-x∥ 2 2 . Let P ′ := q p = ∥p∥ 2 2 , -2p T , 1 T | ∀p ∈ P and let q * ∈ arg sup q∈P ′ e 3R 2 ∥q∥ 2 1 + 3R 2 ∥q∥ 2 . Let u(p) := w(p) e 3R 2 ∥qp∥ 2 (1+3R 2 ∥qp∥ 2 ) and let (U, D, V ) be the ∥•∥ 1 -SVD of (P ′ , u(•)). Then for every p ∈ P , s(p) ≤ e 3R 2 ∥qp∥ 2 1 + 3R 2 ∥q p ∥ 2    u(p) p ′ ∈P u (p ′ ) + u(p) ∥U (q p )∥ 1    , and q∈P s(q) ≤ e 3R 2 ∥q * ∥ 2 1 + 3R 2 ∥q * ∥ 2 1 + (d + 2) 1.5 . Proof. Let X = x ∈ R d ∥x∥ 2 ≤ 1 , and observe that for every p ∈ P and x ∈ R d , it holds that ∥p -x∥ 2 2 = q T p y , where q p = ∥p∥ 2 2 , -2p T , 1 T and y = 1, x, ∥x∥ 2 2 T . Let Y = 1, x, ∥x∥ 2 2 T x ∈ X . Following the definition of Y , for every y ∈ Y , we obtain that ∥y∥ 2 ≤ 3R 2 . Hence, by plugging p := qp ∥qp∥ and R := 3R 2 ∥q p ∥ for every p ∈ P into Claim 5, we obtain that the for every y ∈ Y and p ∈ P , 1 e 3R 2 ∥qp∥ 2 1 + 3R 2 ∥q p ∥ 2 1 + q T p y ≤ e -|q T p y| ≤ 1 + q T p y . Note that for every p ∈ P , u(p) := w(p) e 3∥qp ∥ 2 (1+3∥qp∥ 2 ) . Thus sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) = sup y∈Y w(p)e -|q T p y| p ′ ∈P ′ w(q) q T p ′ y ≤ e 3R 2 ∥qp∥ 2 1 + 3R 2 ∥q p ∥ 2 sup y∈Y u(p) q T p y + u(p) p ′ ∈P u (p ′ ) q T p ′ y + p ′ ∈P u (p ′ ) ≤ e 3R 2 ∥qp∥ 2 1 + 3R 2 ∥q p ∥ 2     u(p) p ′ ∈P u (p ′ ) + sup y∈Y u(p) q T p y p ′ ∈P u (p ′ ) q T p ′ y     , where the first inequality holds by Claim 5 and the second inequality is by Claim 12. Let f : P ′ × Y → [0, ∞) be a function such that for every q ∈ P ′ and y ∈ Y , f (q, y) = q T y . Plugging in P := P ′ , w := u, d := d + 2, and f := f into Lemma 3 yields for every p ∈ P sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) ≤ e 3R 2 ∥qp∥ 2 1 + 3R 2 ∥q p ∥ 2    u(p) p ′ ∈P u (p ′ ) + u(p) ∥U (q p )∥ 1    . Note that by definition, q * ∈ arg sup q∈P ′ e 3R 2 ∥q∥ 2 1 + 3R 2 ∥q∥ 2 . Then the total sensitivity is bounded by q∈P sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) ≤ e 3R 2 ∥q * ∥ 2 1 + 3R 2 ∥q * ∥ 2 1 + (d + 2) 1.5 . Theorem 6 (Coreset for RBF). Let R ≥ 1 be a positive real number, X = x ∈ R d ∥x∥ 2 ≤ R , and let ε, δ ∈ (0, 1). Let (P, w, X, f ) be query space as in Definition 9 such that every p ∈ P satisfies ∥p∥ 2 ≤ 1. For every x ∈ X and p ∈ P , let f (p, x) := ρ (∥p -x∥ 2 ). Let (S, v) be a call to CORESET(P, w, R, m) where S ⊆ P and v : S → [0, ∞). Then (S, v) ε-coreset of (P, w) with probability at least 1 -δ, if m = O e 12R 2 R 2 d 1.5 ε 2 R 2 + log d + log 1 δ . Proof. First , by plugging in the query space (P, w, R d , f ) into Lemma 13, we obtain that a bound on the sensitivities s(p) for every p ∈ P and a bound on the total sensitivities t := e 12Rfoot_0 1 + 12R 2 1 + (d + 2) 1.5 , since the max q∈P ∥q∥ 2 ≤ 1. Notice that the analysis done in Lemma 13 is analogues to the steps done in Algorithm 1. By plugging the bounds on the sensitivities, the bound on the total sensitivity t, probability of failure δ ∈ (0, 1), and approximation error ε ∈ (0, 1) into Theorem 11, we obtain a subset S ′ ⊆ Q and v ′ : S ′ → [0, ∞) such that the tuple (S ′ , v ′ ) is an ε-coreset for (P, w) with probability at least 1δ.

B.3 PROOF OF THEOREM 7

Lemma 14 (Sensitivity bound w.r.t. the Laplacian loss function). Let P, w, R d , f be query space as in Definition 9 where for every p ∈ P and x ∈ R d , f (p, x) = e -∥p-x∥ 2 . Let P ′ := q p = ∥p∥ u(p) := w(p) e 3 √ ∥qp∥ 2 (1+3 √ ∥qp∥ 2 ) and let (U, D, V ) be the ∥•∥ 1 -SVD of P ′ , u 2 (•) . Then for every p ∈ P , s(p) ≤ e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2    u(p) p ′ ∈P u (p ′ ) + u(p) ∥U (q p )∥ 1    + e ∥p∥ 2 + √ ∥q * ∥ 2 w(p) q∈P w(q) , and q∈P s(q) ≤ 2e 3 √ ∥q * ∥ 2 + e 3 √ ∥q * ∥ 2 1 + 3 ∥q * ∥ 2 1 + √ n (d + 2) 1.25 . Proof. Let X = x ∈ R d ∥x∥ 2 ≤ 1 , and observe that for every p ∈ P and x ∈ R d , it holds that ∥p -x∥ 2 = q T p y , where q p = ∥p∥ 2 2 , -2p T , 1 T and y = 1, x, ∥x∥ 2 2 T . Let Y = 1, x, ∥x∥ 2 2 T x ∈ X . Hence, following Theorem 11, the sensitivity of each point p ∈ P , can be rewritten as sup x∈R d w(p)f (p, x) q∈P w(q)f (q, x) ≤ sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) + sup x∈R d \X w(p)f (p, x) q∈P w(q)f (q, x) . ( ) From here, we bound the sensitivity with respect to subspaces of R d . Handling queries from X. Following the definition of Y , for every y ∈ Y , we obtain that ∥y∥ 2 ≤ 3. Hence, by plugging p := q p for every p ∈ P and R := 3 ∥q p ∥ 2 into Claim 5, we obtain that the for every y ∈ Y and p ∈ P , 1 e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2 1 + q T p y ≤ e -|q T p y| ≤ 1 + q T p y . Note that for every p ∈ P , u(p) := w(p) e 3 √ ∥qp∥ 2 (1+3 √ ∥qp∥ 2 ) . Combining equation 9 and equation 10, yields that sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) ≤ e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2 sup y∈Y u(p) q T p y + u(p) p ′ ∈P u (p ′ ) q T p ′ y + p ′ ∈P w (p ′ ) ≤ e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2      u(p) p ′ ∈P u (p ′ ) + sup y∈Y u(p) q T p y p ′ ∈P u (p ′ ) q T p ′ y      , where the first inequality holds by Claim 5 and the second inequality is by Claim 12. By Cauchy-Schwartz inequality, sup y∈Y u(p) q T p y p ′ ∈P u (p ′ ) q T p ′ y = sup y∈Y u(p) 2 q T p y p ′ ∈P u (p ′ ) 2 q T p ′ y ≤ sup y∈Y u(p) 2 q T p y p ′ ∈P u(q) 2 q T p ′ y ≤ sup y∈R d+2 u(p) 2 q T p y p ′ ∈P u(q) 2 q T p ′ y , where the last inequality follows from the properties associated with the supremum operation. Let u ′ : P ′ → [0, ∞) be a weight function such that for every p ∈ P , u ′ (q p ) = u(p) 2 , f : P ′ ×Y → [0, ∞) be a function such that for every q ∈ P ′ and y ∈ Y , f (q, y) = q T y . Plugging in P := P ′ , w := u, d := d + 2, and f := f into Lemma 3 yields for every p ∈ P sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) ≤ e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2    u(p) p ′ ∈P u (p ′ ) + u(p) ∥U (q p )∥ 1    . ( ) Note that by definition, q * ∈ arg sup q∈P ′ e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2 . Then the total sensitivity is bounded by q∈P sup x∈X w(p)f (p, x) q∈P w(q)f (q, x) ≤ e 3 √ ∥q * ∥ 2 1 + 3 ∥q * ∥ 2 1 + √ n (d + 2) 1.25 , where the √ n follows from p ′ ∈P ∥U (q p ′ )∥ 1 ≤ √ n p ′ ∈P ∥U (q p ′ )∥ 1 , which is used when using Lemma 3. This inequality is a result of Cauchy-Schwartz's inequality. Handling queries from R d \ X. First, we observe that for any integer m ≥ 1 and x, y ∈ R m , -∥x∥ 2 -∥y∥ 2 ≤ -∥x -y∥ 2 ≤ ∥x∥ 2 -∥y∥ 2 , ( ) where the first inequality holds by the triangle inequality, and the second inequality follows from the reverse triangle inequality. Thus, by letting x p ∈ arg sup x∈R d \X w(p)f (p,x) q∈P w(q)f (q,x) for every p ∈ P , we obtain that w(p)f (p, x p ) q∈P w(q)f (q, x p ) ≤ w(p)e ∥p∥ 2 -∥xp∥ 2 q∈P w(q)e -∥q∥ 2 -∥xp∥ 2 ≤ w(p)e ∥p∥ 2 -∥xp∥ 2 q∈P w(q)e - √ ∥q * ∥ 2 -∥xp∥ 2 = e ∥p∥ 2 + √ ∥q * ∥ 2 w(p) q∈P w(q) , (16) where the first inequality holds by equation 15, and the second inequality holds since ∥q * ∥ 2 ≥ ∥p∥ 2 for every p ∈ P . Combining equation 9, equation 13, equation 14, and equation 16, yields that for every p ∈ P s(p) ≤ e 3 √ ∥qp∥ 2 1 + 3 ∥q p ∥ 2    u(p) p ′ ∈P u (p ′ ) + u(p) ∥U (q p )∥ 1    + e ∥p∥ 2 + √ ∥q * ∥ 2 w(p) q∈P w(q) and p∈P s(p) ≤ 2e 3 √ ∥q * ∥ 2 + e 3 √ ∥q * ∥ 2 1 + 3 ∥q * ∥ 2 1 + √ n (d + 2) 1.25 . Theorem 7 (Coreset for the Laplacian loss function). Let P, w, R d , f be query space as in Definition 9 such that every p ∈ P satisfies ∥p∥ 2 ≤ 1. For x ∈ R d and p ∈ P , let f (p, x) := e -∥p-x∥ 2 . Let ε, δ ∈ (0, 1). Then there exists an algorithm which given P, w, ε, δ return a weighted set (S, v) where S ⊆ P of size O √ nd 1.25 ε 2 log n + log d + log 1 δ and a weight function v : S → [0, ∞) such that (S, v) is an ε-coreset of (P, w) with probability at least 1δ. Proof. Plugging in the query space (P, w, R d , f ) into Lemma 14, we obtain that a bound on the sensitivities s(p) for every p ∈ P and a bound on the total sensitivities t := 2e 5 + 3e 5 (1 + 5) 1 + √ n (d + 2) 1.5 , since the max q∈P ∥q∥ 2 ≤ 1. By plugging the bounds on the sensitivities, the bound on the total sensitivity t, probability of failure δ ∈ (0, 1), and approximation error ε ∈ (0, 1) into Theorem 11, we obtain a subset S ′ ⊆ Q and v ′ : S ′ → [0, ∞) such that the tuple (S ′ , v ′ ) is an ε-coreset for (P, w) with probability at least 1δ.

B.4 PROOF OF THEOREM 8

To prove Theorem 8, we first prove the following theorem. Theorem 15. There exists an algorithm that samples O e 8R 2 R 2 d 1.5 ε 2 R 2 + log d + log 2 δ points to form weighted sets (S 1 , w 1 ) and (S 2 , w 2 ) such that with probability at least 1 -2δ, p∈P y(p)ρ(∥p -c (i) ∥ 2 ) - p∈S1 w 1 (p)e -∥p-c (i) ∥ 2 2 + q∈S2 w 2 (q)e -∥q-c (j) ∥ 2 2 ϕ + c (i) + ϕ -c (i) ≤ ε. Proof. We first construct strong coresets for both ϕ + and ϕ  of size O e 8R 2 R 2 d 1.5 ε 2 R 2 + log d + log 2 δ . Hence by the definition of a strong coreset, we have that p∈P y(p)>0 y(p)ρ(∥p -c (i) ∥ 2 ) - p∈S1 w 1 (p)ρ(∥p -c (i) ∥ 2 ) ≤ ε p∈P y(p)>0 y(p)ρ(∥p -c (i) ∥ 2 ) = εϕ + (c (i) ) and p∈P y(p)<0 y(p)ρ(∥p -c (i) ∥ 2 ) - q∈S2 w 2 (q)ρ(∥q -c (i) ∥ 2 ) ≤ ε p∈P y(p)<0 |y(p)|ρ(∥p -c (i) ∥ 2 ) = εϕ -(c (i) ) for any input x ∈ R n . Thus by triangle inequality and a slight rearrangement of the inequality, we have that with probability at least 1δ, p∈P y(p)ρ(∥p -c (i) ∥ 2 ) - p∈S1 w 1 (p)e -∥p-c (i) ∥ 2 2 + q∈S2 w 2 (q)e -∥q-c (j) ∥ 2 2 ϕ + c (i) + ϕ -c (i) ≤ ε. To prove Theorem 8, we split α i into the sets {i|α i > 0} and {i|α i < 0} and apply Theorem 15 to each of the sets. We emphasize that this argument is purely for the purposes of analysis so that the algorithm itself does not need to partition the α i quantities (and so the algorithm does not need to recompute the coreset when the values of the weights α i change over time). We thus get the following guarantee: Theorem 8.  (p)ϕ(p) -    i∈[L] αi>0 α i p∈S1 w 1 (p)e -∥p-c (i) ∥ 2 2 + j∈[L] αj <0 α j q∈S2 w 2 (q)e -∥q-c (j) ∥ 2 2    i∈[L] αi>0 α i ϕ + c (i) + ϕ -c (i) + i∈[L] αi<0 |α i | ϕ + c (i) + ϕ -c (i) ≤ ε.

C LOWER BOUND ON THE CORESET SIZE FOR THE GAUSSIAN LOSS FUNCTION -ILLUSTRATION

Here, we illustrate one dataset such that for any approximation ε, the ε-coreset must contain at least half the points to ensure the desired approximation from a theoretical point of view.

D EXPERIMENTAL RESULTS -EXTENDED D.1 MNIST RESULTS

In what follows, we present our subset selection results on the MNIST dataset at Tables 5-7 show the standard deviation results over five training runs on CIFAR10, CIFAR100, and MNIST datasets, respectively. In this experiment, we aim to show the efficacy of our coreset against uniform sampling for the task of fitting RBFs. Specifically speaking, we have taken two datasets: (i) 3D spatial networks (Dua et al., 2017) , and (ii) HTRU2 pulsar dataset (Dua et al., 2017) . In the following figures, we report the averaged approximation factor over 20 trials, where the shaded area corresponds to the median absolute deviation, a more robust statistical measurement that the standard deviation.



2 , -2p T , 1T | ∀p ∈ P and let q * ∈ arg sup q∈P ′ e √∥q∥ 2 1 + 3 ∥q∥ 2 . Let



Figure 1: Our contribution in a nutshell.

Figure 2: Function approximation.

Figure 3: Evenly distributed points on some circle where the minimal distance between each point and any other point is at least √ ln n, where in this example n = 10.

start i.e., training on the whole data for 50% of the training time before training the other 50% on the coreset, where such methods are denoted by adding the suffix -WARM. (iii) a more efficient version of each of the competing methods denoted by adding the suffix PB (more details are given atKillamsetty et al. (

for each epoch. For MNIST, we trained the LeNet model for 200 epochs. For CIFAR10 and CIFAR100, we trained the ResNet18 for 300 epochs -all on batches of size 20 for the subset selection training versions. We train the data selection methods and the entire data training with the same number of epochs; the main difference is the number of samples used for training a single epoch. All experiments were executed on V100 GPUs. The reported test accuracy in the results is after averaging across five runs.

Data Selection Results for CIFAR10 using ResNet-18

Data Selection Results for CIFAR100 using ResNet-18

Data Selection Results for ImageNet2012 using ResNet-18

There exists an algorithm that samples O e 8R 2



Data Selection Results for MNIST using LeNet

Data Selection Results for CIFAR10 using ResNet-18: Standard deviation of the Model

Data Selection Results for CIFAR100 using ResNet-18: Standard deviation of the Model (for 5 runs)Standard deviation of the Model(for 5 runs)

Data Selection Results for MNIST using LeNet: Standard deviation of the Model (for 5

Uniform sampling Our coreset

Figure 4 : The averaged approximation error: The x-axis is the size of the chosen subset, while the y axis is the averaged approximation error. This is with respect to dataset (i).  Figure 5 : The averaged approximation error: The x-axis is the size of the chosen subset, while the y axis is the averaged approximation error. This is with respect to dataset (ii).

