APPROXIMATING ANY FUNCTION VIA CORESET FOR RADIAL BASIS FUNCTIONS: TOWARDS PROVABLE DATA SUBSET SELECTION FOR EFFICIENT NEURAL NETWORKS TRAINING

Abstract

Radial basis function neural networks (RBFNN) are well-known for their capability to approximate any continuous function on a closed bounded set with arbitrary precision given enough hidden neurons. Coreset is a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, etc.). In this paper, we suggest the first coreset construction algorithm for RBFNNs, i.e., a small weighted subset which approximates the loss of the input data on any radial basis function network and thus approximates any function defined by an RBFNN on the big input data. This is done by constructing coresets for radial basis and Laplacian loss functions. We use our coreset to suggest a provable data subset selection algorithm for training deep neural networks, since our coreset approximates every function, it should approximate the gradient of each weight in a neural network as it is defined as a function on the input. Experimental results on function approximation and dataset subset selection on popular network architectures and data sets are presented, demonstrating the efficacy and accuracy of our coreset construction.

1. INTRODUCTION

Radial basis function neural networks (RBFNNs) are artificial neural networks that generally have three layers: an input layer, a hidden layer with a radial basis function (RBF) as an activation function, and a linear output layer. In this paper, the input layer receives a d-dimensional vector x ∈ R d of real numbers. The hidden layer then consists of various nodes representing RBFs, to compute ρ(∥x-c i ∥ 2 ) := exp -∥xc i ∥ 2 2 , where c i ∈ R d is the center vector for neuron i across, say, N neurons in the hidden layer. The linear output layer then computes N i=1 α i ρ(∥x -c i ∥ 2 ), where α i is the weight of neuron i in the linear output neuron. Therefore, RBFNNs are feed-forward neural networks because the edges between the nodes do not form a cycle, and enjoy advantages such as simplicity of analysis, faster training time, and interpretability, compared to alternatives such as convolutional neural networks (CNNs) and even multi-layer perceptrons (MLPs) (Padmavati, 2011) . Function approximation via RBFNNs. RBFNNs are universal approximators in the sense that an RBFNN with a sufficient number of hidden neurons (large N ) can approximate any continuous function on a closed, bounded subset of R d with arbitrary precision (Park & Sandberg, 1991) , i.e., given a sufficiently large input set P of n points in R d and given its corresponding label function y : P → R, an RBFNN, can be trained to approximate the function y. Therefore, RBFNNs are commonly used across a wide range of applications, such as function approximation (Park & Sandberg, 1991; 1993; Lu et al., 1997) , time series prediction (Whitehead & Choate, 1996; Leung et al., 2001; Harpham & Dawson, 2006 ), classification (Leonard & Kramer, 1991; Wuxing et al., 2004; Babu & Suresh, 2012), and system control (Yu et al., 2011; Liu, 2013) , due to their faster learning speed. For a given size of RBFNN (number of neurons in the hidden layer) and an input set, the aim of this paper is to compute a small weighted subset that approximates the loss of the input data on any radial basis function neural network of this size and thus approximates any function defined (approximated) by such an RBFNN on the big input data. This small weighted subset is denoted by coreset. Coresets. Usually, in machine/deep learning, we are given input set P ⊆ R d of n points, its corresponding weights function w : P → R, a set of queries X (a set of candidate solutions for the involved optimization problem), and a loss function f : P × X → [0, ∞). The tuple (P, w, X, f ) is called query space, and it defines the optimization problem at hand -where usually, the goal is to find x * ∈ arg min x∈X p∈P w(p)f (p, x). Given a query space (P, w, X, f ), a coreset for it is a small weighted subset of the input P that can provably approximate the cost of every query x ∈ X on P (Feldman, 2020); see Definition 1. In particular, a coreset for a RBFNN can approximate the cost of an RBFNN on the original training data for every set of centers and weights that define the RBFNN (see Section 4). Hence, the coreset approximates also the centers and weights that form the optimal solution of the RBFNN (the solution that approximates the desired function). Thus a coreset for a RBFNN would facilitate training data for function approximation without reading the full training data and more generally, a strong coreset for an RBFNN with enough hidden neurons would give a strong coreset for any function that can be approximated to some precision using the RBFNN. To this end, in this paper, we aim to provide a coreset for RBFNNs, and thus provably approximating (providing a coreset to) any function that can be approximated by a given RBFNN. Furthermore, we can use this small weighted subset (coreset) to suggest a provable data subset selection algorithm for training deep neural networks efficiently (on the small subset), since our coreset approximates every function that can be approximated by an RBFNN of this size, it should approximate the gradient of each weight in a neural network (if it can be approximated by the RBFNN). Training neural networks on data subset. Although deep learning has become widely successful with the increasing availability of data (Krizhevsky et al., 2017; Devlin et al., 2019) , modern deep learning systems have correspondingly increased in their computational resources, resulting in significantly larger training times, financial costs (Sharir et al., 2020) , energy costs (Strubell et al., 2019) , and carbon footprints (Strubell et al., 2019; Schwartz et al., 2020) . Data subset selection (coresets) allows for efficient learning at several levels (Wei et al., 2014; Kaushal et al., 2019; Coleman et al., 2019; Har-Peled & Mazumdar, 2004; Clarkson, 2010) . By employing a significantly smaller subset of the big dataset, (i) we enable learning on relatively low resource computing settings without requiring a huge number of GPU and CPU servers, (ii) we may greatly optimize the end-to-end turnaround time, which frequently necessitates many training runs for hyper-parameter tweaking, and (iii) because a large number of deep learning trials must be done in practice, we allow for considerable reductions in deep learning energy usage and CO 2 emissions (Strubell et al., 2019) . Several efforts have recently been made to improve the efficiency of machine learning models using data subset selection (Mirzasoleiman et al., 2020a; Killamsetty et al., 2021b; a) . However, existing techniques either (i) employ proxy functions to choose data points, (ii) are specialized to specific machine learning models, (iii) use approximations of parameters such as gradient error or generalization errors, (iv) lack provable guarantees on the approximation error, or (v) require an inefficient gradient computation of the whole data. Most importantly, all of these methods are model/network



Figure 1: Our contribution in a nutshell.

