LEARNING LARGE-SCALE KERNEL NETWORKS

Abstract

This paper concerns large-scale training of Kernel Networks, a generalization of kernel machines that allows the model to have arbitrary centers. We propose a scalable training algorithm -EigenPro 3.0 -based on alternating projections with preconditioned SGD for the alternating steps. This is the first linear space algorithm for training kernel networks, which enables training models with large number of centers. In contrast to classical kernel machines, but similar to neural networks, our algorithm enables decoupling the learned model from the training set. This empowers kernel models to take advantage of modern methodologies in deep learning, such as data augmentation. We demonstrate the promise of EigenPro 3.0 on several experiments over large datasets. We also show data augmentation can improve performance of kernel models.

1. INTRODUCTION

Kernel Machines are predictive models described by the non-parametric estimation problem, min f ∈H L(f ) := 1 2 n i=1 L(y i , f(x i )) + λ f 2 H , where H is a reproducing kernel Hilbert space (RKHS), and (X, y) = {x i , y i } n i=1 are training samples. By the representer theorem Wahba (1990) , the solution to this problem has the form, f (x) = n i=1 α i K(x, x i ) ∈ H, ( ) where K is the reproducing kernel corresponding to H. The weights α = (α i ) ∈ R n are chosen to fit (X, y). For example, kernel ridge regression takes the square loss L(u, v) = (uv) 2 , and the weights α ∈ R n of the learned model are the unique solution to the n × n linear system of equations, (K(X, X) + λI n )α = y, (3) where [K(X, X)] ij = K(x i , x j ) is the matrix of pairwise kernel evaluations between samples. However, observe that the kernel machine from equation (2), is strongly coupled to the training set, i.e., predictions from a learned model require access to the entire training dataset. There is no explicit control on the model size, it is always the same as the size of the dataset n. Such a coupling is inconvenient from an engineering perspective, and limits the scalability to large datasets for inference as well as for training. For instance, when fresh training samples are available, a larger system of equations needs to be solved, from scratch, to retrain the model. In contrast, neural networks are decoupled from the training set. In particular, a pretrained neural network can be finetuned without any access to the original dataset. This decoupling affords the practitioner tremendous flexibility and is crucial for large-scale learning. Deep learning methodologies take advantage of this scalability. For example, data augmentation is a widely used training technique to boost performance of neural networks, see Shorten & Khoshgoftaar (2019) for a review. Here, we augment the training set with artificial samples, which are obtained via perturbations or transformations to the true samples. For kernel machines, data augmentation means increasing the size of the dataset, and hence implicitly also the model size. Hence, data augmentation is prohibitively expensive for learning standard kernel machines. Kernel Networks generalize kernel machines by allowing the flexibility to choose arbitrary centers. Perhaps most importantly, this leads the learned model to be decoupled from the training set. Definition 1 (Kernel Network). Given a kernel K(•, •), a set of centers Z := {z i } p i=1 , and weights α = (α i ) ∈ R p , a kernel network is a function x → f (x; Z, α) given by f (x; Z, α) = p i=1 α i K(x, z i ). (4) We refer to p as the model size, since there are p degrees of freedom for the predictor. Note that by definition, kernel networks do not require access to the training set to make predictions. This helps inference as well as training when p n, and enables models to be trained on largescale datasets. This also provides explicit capacity control by choosing p. Such a control is lacking in classical kernel machines since the model size is always n. Kernel networks are classically studied in machine learning in the form of RBF networks, which correspond to radial kernels K(x, z) = φ( xz ). RBF networks were introduced by Broomhead & Lowe (1988) as a function approximation technique. Like neural networks, they are universal approximators for functions in L p (R d ), see Park & Sandberg (1993) ; Poggio & Girosi (1990) . Our definition extends to all positive definite kernels. This extension allows using kernels like the Convolutional Neural Tangent Kernel, which is neither radial nor rotationally invariant, among others.

1.1. PRIOR WORK

In the case when Z = X, which corresponds to standard kernel machines, there exist several solvers Wang et al. (2019) ; Gardner et al. (2018a); van der Wilk et al. (2020) . For certain special kernels Si et al. (2014) enable speed-ups depending on the scale hyperparameter. Classical procedures to learn kernel networks in their full generality, are to plug-in the functional form of equation ( 4) to solve problem (1). For example for the square loss, the solution satisfies, (K(X, Z) K(X, Z) + λK(Z, Z))α = K(X, Z) y, (5) where K(X, Z) ∈ R n×p is the pairwise kernel evaluation between data x i and centers z j . Notice when λ is small, the solution converges to K(X, Z) † y, which involves the pseudo-inverse. For other loss functions, iterative methods such as gradient descent can be used for minimizing the objective in terms of the weights α. Several regularized ERM approaches have been studied, see Que & Belkin (2016) and Scholkopf et al. (1997) for a review and comparisons. These methods suffer from poorly conditioned matrices, which significantly limits their rate of convergence. See Figure 3 in the Appendix for a deeper discussion on this approach and a comparison with problem-specific solvers. Nyström approximation: Kernel networks with Z ⊂ X have been studied extensively following Williams & Seeger (2000) . This is perhaps the predominant strategy for applying kernel machines at scale, in the general case when random feature are hard to compute. Methods such as NYTRO Camoriano et al. (2016) , and FALKON Rudi et al. (2017) are designed to work with such models. These methods require quadratic memory in terms of the model size. For example Meanti et al. (2020) only train models with 100,000 centers. Scaling these methods to higher model sizes is memory intensive. While these methods were not designed to train kernel networks in their generality, they perform surprisingly well for this task in high dimensions, since the distribution of the centers often closely resembles the distribution of the data. However one must exercise caution for training general kernel networks using these methods, i.e., when Z ⊂ X. Random features model: Decoupled models for kernel machines have been considered earlier, perhaps most elegantly in the Random Features framework by Rahimi & Recht (2007) . However, it is not straightforward to find the correct distribution that yields a desired target kernel, since sampling from the Fourier measure is not always tractable in high dimensions, especially for kernels that are not rotation invariant. Gaussian Process: In the literature on GPs, sparse GPs Titsias ( 2009) is similar to kernel networks considered above. These models have so-called inducing points that reduce the model complexity. While several follow-ups such as Wilson & Nickisch (2015) and Gardner et al. (2018b) have been applied in practice, they require quadratic memory in terms of the number of inducing points, preventing scaling to large models. Indeed the inducing points interpretation is perhaps the most useful in choosing 'good' centers for kernel networks. EigenPro: (short for Eigenspace Projections) is an iterative algorithm for kernel regression, i.e., when Z = X. It solves the linear system (3) by taking advantage of the problem structure. The algorithm applies a preconditioned Richarson iteration Richardson (1911) , based on projecting gradients to certain eigenspaces of K(X, X). EigenPro 2.0 Ma & Belkin (2019) improved upon EigenPro 1.0 Ma & Belkin (2017) by reducing the computational and memory costs for the preconditioner, by applying a stochastic approximation for estimating, and projecting onto the relevant eigenspaces. EigenPro 2.0 cannot solve the general problem of learning a kernel machine, i.e., when Z = X. Our extension EigenPro 3.0, proposed in this paper fills this gap.

1.2. MAIN CONTRIBUTIONS

We develop a scalable iterative algorithm for learning kernel networks with a low memory footprint. Our training algorithm is derived using alternating but separate eigenspace projection steps. Importantly, our preconditioning preserves the decoupling between the model and the training set. EigenPro serves as the basis for our approach, and we use the same form for the preconditioner for more general problem of learning kernel networks. Our algorithm requires an additional projection step neccessary for this problem, which can be solved by a decoupled instance of EigenPro. The focus of this paper is the design of the algorithm for training kernel networks in full generality with a linear space complexity in terms of the model size. We omit generalization and optimization properties of the algorithm, and will consider them in follow-up works. As such, these methods are expected to converge and behave well with analyses from linear systems and convex quadratic optimization being directly applicable. Furthermore, our method converges to a consistent estimator if we consider a student-teacher setup with known model centers. Some noteworthy highlights of our training algorithm -EigenPro 3.0 -are: 1. Model decoupled from training data: Our algorithm fully respects the decoupling between the model and training data, and allows for any configuration of model centers. In particular, we do not require any label information on the centers. The flexibility of a decoupled model allows us to apply data augmentation for learning kernel networks. We demonstrate a gain in performance with this approach. We can also train overparameterized kernel networks with p > n. 2. Linear space complexity: Our algorithm can run with O(p) memory, and O(p 2 ) computations per iteration. Consequently, we can handle very large model sizes with a potential to scale much further. For example in our numerical experiments, we have trained models with 512,000 degrees of freedom. To our knowledge, this is the first general kernel network of this size trained with ≤ 100 GB RAM. Organization: In Section 3, we derive a vanilla version of our algorithm as a function space projected preconditioned gradient descent, with a decoupled model-agnostic preconditioner. In Section 4, we introduce several stochastic approximations that make our algorithm much faster and makes it scalable to large-scale datasets. Section 5 demonstrate the scalability of our algorithm to large datasets and large models over several datasets. Proofs are relegated to the Appendix.

2. PRELIMINARIES AND NOTATION

In what follows, functions are lowercase letters a, sets are uppercase letters A, vectors are lowercase bold letters a, matrices are uppercase bold letters A, operators are caligraphic letters A, spaces and subspaces are bolfdace caligraphic letters A. Subscripts to sets, vectors, matrices indicate size. If K is a reproducing kernel for an RKHS H, then we have a, K(•, x) H = a(x) ∀ a ∈ H, K(•, x), K(•, z) H = K(x, z) = K(z, x). ( ) Evaluations and kernel matrices: The vector of evaluations of a function f over a set X = {x i } n i=1 is f (X) := (f (x i )) ∈ R n . We denote the kernel matrices K(X, Z) ∈ R n×p , K(X, X) ∈ R n×n , K(Z, Z) ∈ R p×p , and K(Z, X) = K(X, Z) . Similarly, K(•, X) ∈ H 1×n , and K(•, Z) ∈ H 1×p , and for a set A = {a i } k i=1 , and a vector α = (α i ) ∈ R k , we use the notation K(•, A)α := k i=1 K(•, a i )α i ∈ H, K(z, A)α := k i=1 K(z, a i )α i ∈ R. Finally, for an operator A, a function a, and a set A = {a i } k i=1 , by A {a} (A) := (b(a i )) ∈ R k where b = A (a) , Definition 2. [Top-q eigensystem] Let λ 1 > λ 2 > . . . > λ n , be the eigenvalues of a hermitian matrix A ∈ R n×n , i.e., for unit-norm e i , we have Ae i = λ i e i . Then we call the tuple (Λ q , E q , λ q+1 ) the top-q eigensystem, where Λ q = diag(λ 1 , λ 2 , . . . , λ q ) ∈ R q×q , and E q = [e 1 , e 2 , . . . , E q ] ∈ R n×q . Fréchet derivative: Given a function J : H → R, the Frech'et derivative of J with respect to f is a linear functional, denoted ∇ f J, such that lim h H →0 |J(f + h) -J(f ) -∇ f J(h)| h H = 0. Since ∇ f J is a linear functional, it lies in the dual space H * . Since H is a Hilbert space, it is self-dual, whereby H * = H. If L is the square loss for a given dataset (X, y), i.e., L(f ) := 1 2 n i=1 (f (x i ) -y i ) 2 we can apply the chain rule, and using equation ( 6), and the fact that ∇ f f, g H = g, we get, that the Fréchet derivative at f = f 0 is, ∇ f L(f 0 ) = n i=1 (f 0 (x i ) -y i )∇ f f (x i )= n i=1 (f 0 (x i )-y i )K(•, x i ) = K(•, X)(f 0 (X) -y). (10) Hessian operator: The Hessian operator ∇ 2 f L : H → H for the square loss is given by, K := n i=1 K(•, x i ) ⊗ K(•, x i ), K {f } (z) := n i=1 K(z, x i )f (x i ) = K(z, X)f (X). Note that K is surjective on X , and hence invertible when restricted to X . Note that when x i i.i.d. ∼ P, for some measure P, the above summation, on rescaling by 1 n , converges due to strong law as, lim n→∞ K {f } n = T K {f } := K(•, x)f (x) dP(x), which is an integral operator. The following lemma relates the spectra of K and K(X, X). Proposition 1 (Nyström extension). For 1 ≤ i ≤ n, let λ i be an eigenvalue of K, and ψ i its unit H-norm eigenfunction, i.e., K {ψ i } = λ i ψ i . Then λ i is also an eigenvalue of K(X, X). Moreover if e i , is its unit-norm eigenvector, i.e., K(X, X)e i = λ i e i , we have, ψ i = K(•, X) e i √ λ i . ( ) We review EigenPro 2.0 which is a closely related algorithm for kernel regression, i.e., when Z = X. Background on EigenPro (short for Eigenspace Projections): Proposed in Ma & Belkin (2017) , EigenPro 1.0 is an iterative solver for solving the linear system in equation (3) based on a preconditioned stochastic gradient descent in the Hilbert space, f t+1 = f t -η • P ∇ f L(f t ) . ( ) Here P is a preconditioner. Due to its iterative nature, EigenPro can handle λ = 0 in equation equation ( 3), corresponding to the problem of kernel interpolation, since in that case, the learned model satisfies f (x i ) = y i for all samples in the training set. It can be shown that the following iteration in R n α t+1 = α t+1 -η(I n -Q)(K(X, X)α t -y), emulates equation ( 14) in H, see Lemma 3 in the Appendix. The above iteration is a preconditioned version of the Richardson iteration, Richardson (1911), with well-known convergence properties. Here, Q as a rank-q symmetric matrix obtained from the top-q eigensystem of K(X, X), with q n. The preconditioner, P acts to flatten the spectrum of the Hessian K. In R n , the matrix I n -Q has the same effect on K(X, X). The largest stable learning rate is then 2 λq+1 instead of 2 λ1 . Hence a larger q, allows faster training when P is chosen appropriately. EigenPro 2.0 proposed in Ma & Belkin (2019) , applies a stochastic approximation for P based on the Nyström extension. We apply EigenPro 2.0 to perform an inexact projection step in our algorithm. 3 EigenPro 3.0: PROJECTED PRECONDITIONED GRADIENT DESCENT In this section we derive EigenPro 3.0exact-projection, a precursor to EigenPro 3.0, for learning a kernel network. This algorithm is based on a function space projected gradient method. However it does not scale well. In Section 4 we make it scalable by applying stochastic approximations, which finally yields EigenPro 3.0 (Algorithm 1). We want to solve the following constrained infinite dimensional problem, minimize f L(f ) = n i=1 (f (x i ) -y i ) 2 , subject to f ∈ Z := span {K(•, z j )} p j=1 . ( ) Thus the learned model f is a linear combination of functions {K(z j , •)} p j=1 , just like Definition 1. We will apply the function-space projected gradient method to solve this problem, f t+1 = proj Z f t -ηP ∇ f L(f t ) , where proj Z (u) := argmin f ∈Z u -f 2 H , where ∇ f L(f t ) is the Fréchet derivative at f t as given in equation ( 10), P is a preconditioning operator given in equation ( 25), η is a learning rate, and proj Z : H → Z is the projection operator that projects functions from H onto the subspace Z. Remark 1. Note that even though equation ( 17) is an iteration over functions which are infinite dimensional objects {f t } t≥0 , we can represent this iteration in finite dimensions as {α t } t≥0 , where α t ∈ R p . To see this, observe that f t ∈ Z, whereby we express it as, f t = K(•, Z)α t ∈ H, for an α t ∈ R p . Furthermore, the evaluation of the function f t above at X, is f t (X) = K(X, Z)α t ∈ R n . Gradient: Due to equations ( 10) and ( 19) together, the gradient is given by the function, ∇ f L(f t )= K(•, X)(f t (X)-y)= K(•, X)(K(X, Z)α t -y) ∈ X := span({K(•, x i )} n i=1 ) . (20) Observe that the gradient does not lie in Z and hence a step of gradient descent would leave Z, and the constraint is violated. This necessitates a projection onto Z. For finitely generated subspaces such as Z, the projection operation involves solving a finite dimensional linear system. H-norm projection: Functions in Z can be expressed as K(•, Z)θ. Hence we can rewrite the projection problem in equation ( 17) as a minimization in R p , with θ as the unknowns. Observe that, argmin f f -u H = argmin f f -u 2 H = argmin f f, f H -2 f, u H since u 2 H does not affect the solution. Further, using f = K(•, Z)θ, we can show that f, f H -2 f, u H = θ K(Z, Z)θ -2θ u(Z). (21) This yields a simple method to calculate the projection onto Z. proj Z {u} = argmin f ∈Z f -u H = K(•, Z) θ = K(•, Z)K(Z, Z) -1 u(Z) ∈ Z, where θ = argmin θ∈R p θ K(Z, Z)θ -2θ u(Z) = K(Z, Z) -1 u(Z). ( ) Notice that θ above is linear in u, and f t (Z) = K(Z, Z)α t . Hence we have the following lemma. Algorithm 1 EigenPro 3.0 Require: Data (X, y), centers Z, batch size m, Nyström size s,, preconditioner level q. 1: Fetch subsample X s ⊆ X of size s 2: (E, Λ) ← top-q eigensystem of K(X s , X s ) 3: C = K(Z, X s )E(Λ -1 -λ q+1 Λ -2 )E ∈ R p×s 4: while Stopping criterion is not reached do 5: Fetch minibatch (X m , y m ) 6: g m ← K(X m , Z)α -y m 7: h ← K(Z, X m )g m -CK(X s , X m )g m 8: θ ← EigenPro 2.0(Z, h) 9: α ← α -n m η θ 10: end while Algorithm 2 EigenPro 3.0 exact-projection Require: Data (X, y), centers Z, initialization α 0 , preconditioning level q. 1: (E, Λ) ← top-q eigensystem of K(X, X) 2: Q ← E(I q -λ q+1 Λ -1 )E ∈ R n×n 3: while Stopping criterion not reached do 4: g ← K(X, Z)α -y 5: h ← K(Z, X)(I n -Q)g 6: θ ← K(Z, Z) -1 h 7: α ← α -η θ 8: end while EigenPro 2.0(Z, h) approximates K(Z, Z) -1 h See Table 1 for comparison of costs. Proposition 2 (Projection). The projection step in equation ( 17) can be simplified as, f t+1 = f t -η K(•, Z)K(Z, Z) -1 P ∇ f L(f t ) (Z) ∈ Z. ( ) Hence, in order to perform the update, we only need to know P {∇ f L(f t )} (Z), which can be evaluated efficiently for a suitably chosen preconditioner. Data preconditioner agnostic to model: Just like with usual gradient descent, the largest stable learning rate is governed by the largest eigenvalue of the Hessian of the objective in equation ( 16), which is given by equation ( 11). The preconditioner P in equation ( 17) acts to reduce the effect of a few large eigenvalues. We choose P given in equation ( 25), just like Ma & Belkin (2017) . P := I - q i=1 1 - λ q+1 λ q ψ i ⊗ ψ i : H → H. Recall from Section 2 that ψ i are eigenfunctions of the Hessian K, characterized in Proposition 1. Note that this preconditioner is independent of Z. Since ∇ f L(f t ) ∈ X , we only need to understand P on X . Let (Λ q , E q ) be the top-q eigensystem of K(X, X), see Def. 2. Define the rank-q matrix, Q = E q (I q -λ q+1 Λ -1 q )E q ∈ R n×n . The following lemma outlines the computation involved in preconditioning. Proposition 3 (Preconditioner). The action of P from equation (25) on functions in X is given by, P {K(•, X)a} = K(•, X)(I n -Q)a, ∀ a ∈ R m . Since we know from equation (20 ) that ∇ f L(f t ) = K(•, X)(K(X, Z)α t -y), we have, P ∇ f L(f t ) (Z) = K(Z, X)(I n -Q)(K(X, Z)α t -y). The following lemma combines this with Proposition 2 to get the update equation from Algorithm 2. Lemma 1 (Algorithm 2 iteration). The following iteration in R p emulates equation ( 17) in H, α t+1 = α t -η K(Z, Z) -1 K(Z, X)(I n -Q)(K(X, Z)α t -y). Algorithm 2 does not scale well to large models and large datasets. We now propose stochastic approximations that drastically make it scalable to both large models as well as large datasets.

APPROXIMATIONS

Algorithm 2 suffers from 3 main issues. It requires -(i) access to entire dataset of size O(n) at each iteration, (ii) O(n 2 ) memory to calculate the preconditioner Q, and (iii) O(p 3 ) for the matrix inversion corresponding to an exact projection. This prevents scalability to large n and p.

Stochastic gradients:

We can replace the gradient with stochastic gradients, whereby ∇ f L(f t ) only depends on a batch (X m , y m ) of size m, denoted X m = {x ij } m j=1 and y m = (y ij ) ∈ R m , ∇ f L(f t ) = m j=1 (f (x ij ) -y ij )K(•, x ij ) = K(•, X m )(K(X m , Z)α -y m ) ∈ X . Remark 2. Here we need to scale the learning rate by n m , to get unbiased estimates of ∇ f L(f t ). Nyström preconditioning: We obtain an approximation for the preconditioner P from equation (25), which requires access to all samples. We use the Nystrom extension, see Williams & Seeger (2000) . Consider a subset of size s, X s = {x i k } s k=1 ⊆ X. We introduce the Nyström preconditioner, P s := I - s i=1 1 - λ s q+1 λ s i ψ s i ⊗ ψ s i . ( ) where ψ s i are eigenfunctions of K s := s k=1 K(•, x i k ) ⊗ K(•, x i k ) . Note that K s ≈ s n K since both approximate T K as shown in equation ( 12). However the scaling doesn't affect the preconditioner P s , since ψ s i are unit norm. This preconditioner was first proposed in Ma & Belkin (2019) . Next, we need to understand the action of P s on elements of X . Let (E q , Λ q ) be the top-q eigensystem of K(X s , X s ). Define the rank-q matrix, Q s := E q (I s -λ q+1 Λ -1 q )Λ -1 q E q ∈ R s×s . ( ) Lemma 2 (Nyström preconditioning). Let a ∈ R m , and X m chosen like in equation (31), then, P s {K(•, X m )a} = K(•, X m )a -K(•, X s )Q s K(X s , X m )a. Consequently, using equation ( 31), we get, P s ∇ f L(f t ) (Z)= K(Z, X m ) -K(Z, X s )Q s K(X s , X m ) K(X m , Z)α t -y m ∈ R p . ( ) Inexact projection: The projection step in Algorithm 2 requires the inverse of K(Z, Z) which is computationally expensive. However this step is solving the p × p linear system K(Z, Z)θ = K(Z, X m ) -K(Z, X s )Q s K(X s , X m ) K(X m , Z)α t -y m . ( ) Notice that this is the kernel interpolation problem EigenPro 2.0 can solve. This leads to the update, α t+1 = α t - n m η θ T (EigenPro 3.0 update) where θ T is the solution to equation (36) after T steps of EigenPro 2.0 given in Algorithm 3 in the Appendix. Algorithm 1 implements the update above. Furthermore, EigenPro 2.0 itself applies a preconditioner which only depends on Z, no dependence on X, thus maintaining the decoupling. Remark 3 (Details on inexact-projection using EigenPro 2.0). We apply T steps of EigenPro 2.0 for the approximate projection. This algorithm itself applies a fast preconditioned SGD to solve the problem. The algorithm needs no hyperparameters adjustment. More details in the Appendix. Complexity analysis: We compare the order complexity of the run-time and memory requirement of Algorithm 1 before and after stochastic approximations with FALKON solver in Table 1 .

5. NUMERICAL EXPERIMENTS

We perform experiments on these datasets: (1) CIFAR10, Krizhevsky et al. (2009) ) We used the entire datasets as the model centers. We then trained the model on the augmented set. For CIFAR10 dataset, we apply MixUp and Crop+Flip augmentations, whereas for MNIST and FashionMNIST we add white gaussian noise with different variances. Performance improves by ∼ 1.5% which is comparable to the gains seen by neural networks. The model without augmentation uses EigenPro 2.0 to solve the kernel regression problem on the dataset.

Large scale training:

In this experiment, we want to demonstrate we can learn models with 256,000 centers and 2,000,000 training samples. We also see that increasing the number of samples on a fixed model increases performance of the model. As a baseline, we compare with a standard kernel machine with the same model size p (i.e., a kernel machine trained on the p centers). We apply our Algorithm 1 to train a model with p centers which are a random subset of dataset with n samples. We then systematically study the performance as we vary n as well as p. Figure 1 shows that for a fixed model size adding more data improves performance significantly. Data augmentation: Data augmentation is an important tool for enhancing the performance of deep networks. We demonstrate how this can improve performance of kernel models. We conduct our experiment on CIFAR10 with feature extraction, raw images of MNIST and Fash-ionMNIST. For CIFAR10 augmentation, we performed random cropping and flipping before feature extraction. Also for CIFAR10, we tried mix-up augmentation method in Zhang et al. (2017) after feature extraction. For MNIST and FashionMNIST augmentation we added Gaussian noise with different variances. In all cases, we used the entire training set as the centers and we generated augmented data set using the same training set. We performed 100 epochs for all of them. This means augmentation makes each data set effectively ≈ 100× in size. Figure 2 shows we have significant improvements in accuracy. We show results for the Laplacian kernel with a fixed bandwidth. We did not tune for the optimal bandwidth. Flexible choice of model centers: In our model, the centers z i do not need to be a subset of the training samples. More importantly, the model is agnostic to labels at these points, in contrast to Falkon Rudi et al. (2017) . In this experiment we show an example choice of centers that yields better performance than sub-sampling the dataset. We choose centers as the centroids from a Kmeans clustering procedure with K = p using Omer (2020). Comparison to other works: We demonstrate, in Table 3 , that existing methods on center-based kernel regression for large data sets fail when number of centers are large. We compared our method with FALKON Rudi et al. (2017) and Gpytorch Gardner et al. (2018a) . We used fixed 100GB RAM in all methods. For Gpytorch we tried to reproduce the version Rudi et al. (2017) used in their experiment, stochastic variational GPs (SVGP). We noticed they used SVGP with very small number of centers ≈ 2000. We could not run Gpytorch with large number of centers. Also, note FALKON orginally did not have any notion of centers. They only used sub-sampling for more efficient computation. However, we found out that their code can also be used when centers are not a sub-set of the original data.



Here we store the full K(Z, Z) matrix on GPU. This is not possible for larger number of centers.



Figure2: (Data augmentation.) We used the entire datasets as the model centers. We then trained the model on the augmented set. For CIFAR10 dataset, we apply MixUp and Crop+Flip augmentations, whereas for MNIST and FashionMNIST we add white gaussian noise with different variances. Performance improves by ∼ 1.5% which is comparable to the gains seen by neural networks. The model without augmentation uses EigenPro 2.0 to solve the kernel regression problem on the dataset.

annex

In this section we present 3 stochastic approximation schemes -stochsatic gradients, Nyström preconditioning, and inexact projection -that drastically reduce the computational cost and memory requirements. These approximations together give us Algorithm 1.Algorithm 2 emulates equation ( 17), whereas Algorithm 1 is designed to emulate its approximation,where ∇ f L(f t ) is a stochastic gradient obtained from a subsample of size m, P s is a preconditioner obtained via a Nyström extension based preconditioner from a subset of size s, and proj Z is an inexact projection using EigenPro 2.0 to solve the projection equation K(Z, Z)θ = h. ) We preformed this experiment on extracted feature of CIFAR5M dataset. This shows that both Gpytorch and FALKON fail for large number of centers. Also, note that we are not using any tricks to optimize our algorithm for best time performance. However, if we could store the whole K(Z, Z) matrix, third column, we can get a significant speed up.

6. CONCLUSION

Kernel networks, unlike kernel machines are a class of kernel models decoupled from the training dataset. In this paper we presented a fast and scalable training algorithm -EigenPro 3.0 -for learning general kernel networks on large scale datasets, in a manner that preserves the decoupling property of the learned model, and does not require label information at the model centers.The method relies on alternating projection operations with preconditioning, one dependent only on the training data, while the other only on the model centers. We proposed stochastic approximations that make our algorithm scalable to large datasets as well as large model sizes. The algorithm has a linear space complexity in terms of the number of model centers and does not require any matrix inversion.Through numerical experiments, we provided evidence that demonstrates the promise of this algorithm on several datasets from across problem domains. In particular, we showed kernel models can benefit from data augmentation without increasing the model complexity. Our algorithm enables various modern machine learning techniques for training kernel methods. The next step is to scale up the algorithm to train large models with millions of centers and billions of samples.

