LEARNING LARGE-SCALE KERNEL NETWORKS

Abstract

This paper concerns large-scale training of Kernel Networks, a generalization of kernel machines that allows the model to have arbitrary centers. We propose a scalable training algorithm -EigenPro 3.0 -based on alternating projections with preconditioned SGD for the alternating steps. This is the first linear space algorithm for training kernel networks, which enables training models with large number of centers. In contrast to classical kernel machines, but similar to neural networks, our algorithm enables decoupling the learned model from the training set. This empowers kernel models to take advantage of modern methodologies in deep learning, such as data augmentation. We demonstrate the promise of EigenPro 3.0 on several experiments over large datasets. We also show data augmentation can improve performance of kernel models.

1. INTRODUCTION

Kernel Machines are predictive models described by the non-parametric estimation problem, min f ∈H L(f ) := 1 2 n i=1 L(y i , f(x i )) + λ f 2 H , where H is a reproducing kernel Hilbert space (RKHS), and (X, y) = {x i , y i } n i=1 are training samples. By the representer theorem Wahba (1990) , the solution to this problem has the form, f (x) = n i=1 α i K(x, x i ) ∈ H, ( ) where K is the reproducing kernel corresponding to H. The weights α = (α i ) ∈ R n are chosen to fit (X, y). For example, kernel ridge regression takes the square loss L(u, v) = (uv) 2 , and the weights α ∈ R n of the learned model are the unique solution to the n × n linear system of equations, (K(X, X) + λI n )α = y, (3) where [K(X, X)] ij = K(x i , x j ) is the matrix of pairwise kernel evaluations between samples. However, observe that the kernel machine from equation ( 2), is strongly coupled to the training set, i.e., predictions from a learned model require access to the entire training dataset. There is no explicit control on the model size, it is always the same as the size of the dataset n. Such a coupling is inconvenient from an engineering perspective, and limits the scalability to large datasets for inference as well as for training. For instance, when fresh training samples are available, a larger system of equations needs to be solved, from scratch, to retrain the model. In contrast, neural networks are decoupled from the training set. In particular, a pretrained neural network can be finetuned without any access to the original dataset. This decoupling affords the practitioner tremendous flexibility and is crucial for large-scale learning. Deep learning methodologies take advantage of this scalability. For example, data augmentation is a widely used training technique to boost performance of neural networks, see Shorten & Khoshgoftaar (2019) for a review. Here, we augment the training set with artificial samples, which are obtained via perturbations or transformations to the true samples. For kernel machines, data augmentation means increasing the size of the dataset, and hence implicitly also the model size. Hence, data augmentation is prohibitively expensive for learning standard kernel machines. Kernel Networks generalize kernel machines by allowing the flexibility to choose arbitrary centers. Perhaps most importantly, this leads the learned model to be decoupled from the training set.

