LEARNING LARGE-SCALE KERNEL NETWORKS

Abstract

This paper concerns large-scale training of Kernel Networks, a generalization of kernel machines that allows the model to have arbitrary centers. We propose a scalable training algorithm -EigenPro 3.0 -based on alternating projections with preconditioned SGD for the alternating steps. This is the first linear space algorithm for training kernel networks, which enables training models with large number of centers. In contrast to classical kernel machines, but similar to neural networks, our algorithm enables decoupling the learned model from the training set. This empowers kernel models to take advantage of modern methodologies in deep learning, such as data augmentation. We demonstrate the promise of EigenPro 3.0 on several experiments over large datasets. We also show data augmentation can improve performance of kernel models.

1. INTRODUCTION

Kernel Machines are predictive models described by the non-parametric estimation problem, min f ∈H L(f ) := 1 2 n i=1 L(y i , f(x i )) + λ f 2 H , where H is a reproducing kernel Hilbert space (RKHS), and (X, y) = {x i , y i } n i=1 are training samples. By the representer theorem Wahba (1990) , the solution to this problem has the form, f (x) = n i=1 α i K(x, x i ) ∈ H, ( ) where K is the reproducing kernel corresponding to H. The weights α = (α i ) ∈ R n are chosen to fit (X, y). For example, kernel ridge regression takes the square loss L(u, v) = (uv) 2 , and the weights α ∈ R n of the learned model are the unique solution to the n × n linear system of equations, (K(X, X) + λI n )α = y, where [K(X, X)] ij = K(x i , x j ) is the matrix of pairwise kernel evaluations between samples. However, observe that the kernel machine from equation ( 2), is strongly coupled to the training set, i.e., predictions from a learned model require access to the entire training dataset. There is no explicit control on the model size, it is always the same as the size of the dataset n. Such a coupling is inconvenient from an engineering perspective, and limits the scalability to large datasets for inference as well as for training. For instance, when fresh training samples are available, a larger system of equations needs to be solved, from scratch, to retrain the model. In contrast, neural networks are decoupled from the training set. In particular, a pretrained neural network can be finetuned without any access to the original dataset. This decoupling affords the practitioner tremendous flexibility and is crucial for large-scale learning. Deep learning methodologies take advantage of this scalability. For example, data augmentation is a widely used training technique to boost performance of neural networks, see Shorten & Khoshgoftaar (2019) for a review. Here, we augment the training set with artificial samples, which are obtained via perturbations or transformations to the true samples. For kernel machines, data augmentation means increasing the size of the dataset, and hence implicitly also the model size. Hence, data augmentation is prohibitively expensive for learning standard kernel machines. Kernel Networks generalize kernel machines by allowing the flexibility to choose arbitrary centers. Perhaps most importantly, this leads the learned model to be decoupled from the training set. Definition 1 (Kernel Network). Given a kernel K(•, •), a set of centers Z := {z i } p i=1 , and weights α = (α i ) ∈ R p , a kernel network is a function x → f (x; Z, α) given by f (x; Z, α) = p i=1 α i K(x, z i ). (4) We refer to p as the model size, since there are p degrees of freedom for the predictor. Note that by definition, kernel networks do not require access to the training set to make predictions. This helps inference as well as training when p n, and enables models to be trained on largescale datasets. This also provides explicit capacity control by choosing p. Such a control is lacking in classical kernel machines since the model size is always n. Kernel networks are classically studied in machine learning in the form of RBF networks, which correspond to radial kernels K(x, z) = φ( xz ). RBF networks were introduced by Broomhead & Lowe (1988) as a function approximation technique. Like neural networks, they are universal approximators for functions in L p (R d ), see Park & Sandberg (1993) ; Poggio & Girosi (1990) . Our definition extends to all positive definite kernels. This extension allows using kernels like the Convolutional Neural Tangent Kernel, which is neither radial nor rotationally invariant, among others.

1.1. PRIOR WORK

In the case when Z = X, which corresponds to standard kernel machines, there exist several solvers Wang et al. ( 2019 Classical procedures to learn kernel networks in their full generality, are to plug-in the functional form of equation ( 4) to solve problem (1). For example for the square loss, the solution satisfies, (K(X, Z) K(X, Z) + λK(Z, Z))α = K(X, Z) y, (5) where K(X, Z) ∈ R n×p is the pairwise kernel evaluation between data x i and centers z j . Notice when λ is small, the solution converges to K(X, Z) † y, which involves the pseudo-inverse. For other loss functions, iterative methods such as gradient descent can be used for minimizing the objective in terms of the weights α. Several regularized ERM approaches have been studied, see Que & Belkin (2016) and Scholkopf et al. (1997) for a review and comparisons. These methods suffer from poorly conditioned matrices, which significantly limits their rate of convergence. See Figure 3 in the Appendix for a deeper discussion on this approach and a comparison with problem-specific solvers. Nyström approximation: Kernel networks with Z ⊂ X have been studied extensively following Williams & Seeger (2000) . This is perhaps the predominant strategy for applying kernel machines at scale, in the general case when random feature are hard to compute. Methods such as NYTRO Camoriano et al. (2016), and FALKON Rudi et al. (2017) are designed to work with such models. These methods require quadratic memory in terms of the model size. For example Meanti et al. (2020) only train models with 100,000 centers. Scaling these methods to higher model sizes is memory intensive. While these methods were not designed to train kernel networks in their generality, they perform surprisingly well for this task in high dimensions, since the distribution of the centers often closely resembles the distribution of the data. However one must exercise caution for training general kernel networks using these methods, i.e., when Z ⊂ X. Random features model: Decoupled models for kernel machines have been considered earlier, perhaps most elegantly in the Random Features framework by Rahimi & Recht (2007) . However, it is not straightforward to find the correct distribution that yields a desired target kernel, since sampling from the Fourier measure is not always tractable in high dimensions, especially for kernels that are not rotation invariant. Gaussian Process: In the literature on GPs, sparse GPs Titsias ( 2009) is similar to kernel networks considered above. These models have so-called inducing points that reduce the model complexity. While several follow-ups such as Wilson & Nickisch (2015) and Gardner et al. (2018b) have been applied in practice, they require quadratic memory in terms of the number of inducing points, preventing scaling to large models. Indeed the inducing points interpretation is perhaps the most useful in choosing 'good' centers for kernel networks.



); Gardner et al. (2018a); van der Wilk et al. (2020). For certain special kernels Si et al. (2014) enable speed-ups depending on the scale hyperparameter.

