KERNEL REGRESSION WITH INFINITE-WIDTH NEURAL NETWORKS ON MILLIONS OF EXAMPLES

Abstract

While kernel regression remains an important practical method, its connection to neural networks as their width becomes large has initiated fresh research. These neural kernels have drastically increased performance on diverse and nonstandard data modalities but require significantly more compute, which previously limited their application to smaller datasets. We address this by massively parallelizing their computation across many GPUs. We combine this with a distributed, preconditioned conjugate gradients algorithm to enable kernel regression at a large scale (i.e. up to 5 million examples). Using this approach, we study scaling laws of several neural kernels across many orders of magnitude for the CIFAR-5m dataset. Using data augmentation to expand the original CIFAR-10 training dataset by a factor of 20, we obtain a test accuracy of 91.2% (SotA for a pure kernel method). Finally, we explore other data modalities, obtaining results on protein and small molecule prediction tasks that are competitive with SotA methods.

1. INTRODUCTION

Kernel methods are often contrasted with deep learning, but recent advances in machine learning have identified and developed exciting correspondences (Lee et al., 2018; Matthews et al., 2018; Jacot et al., 2018) . While a useful method in its own right, kernel regression has been used to better understand neural networks and deep learning. More specifically, if the parameters of a neural network are treated as random variables whose distribution is set by the initialization, we can view the neural network as a random function. Then as the width of the network becomes large, the distribution of this random function is a Gaussian process with a specific covariance function or kernel. We refer to kernels that arise from this connection with infinite-width neural networks as neural kernels. The specific kernel is determined by the architecture, inference type and other hyperparameters of the neural network. Moreover, the connection between neural networks and Gaussian processes has generated many high-performance kernels for diverse or nonstandard data modalities, such as images, sequences, and graphs. This performance often comes at a cost, as the kernels require significantly more compute than standard kernels such as RBFs. For example, computing the entire kernel for the CIFAR-10 dataset takes less than 1 GPU minute for an RBF kernel but around 300 GPU hours for the Myrtle kernel (Shankar et al., 2020; Lee et al., 2020) . However, this increase in compute significantly decreases the test error rate from around 40% to 10%. The added demands of simply computing entries of the kernel is in addition to challenges posed from the cubic scaling in time and quadratic scaling in memory of inference for kernel regression with dataset size. Approximate inference methods frequently reduce memory requirements by recomputing kernel entries on the fly, which is infeasible for these expensive kernels. Such challenges have limited our understanding of infinite-width models to small datasets. In particular, while scaling laws have been studied across many orders of magnitude for neural networks, the same is not true of their corresponding kernels. Similarly, while it is common to augment training datasets in neural networks, significantly expanding their size, the benefits for kernel methods have not been as thoroughly explored.

1.1. CONTRIBUTIONS

In this work, we address these two main computational challenges (computing large, complex kernels and using them for kernel regression) by parallelizing and scaling up existing algorithms to 1

