DATASET META-LEARNING FROM KERNEL RIDGE-REGRESSION

Abstract

One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept ofapproximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP ) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP -learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation.

1. INTRODUCTION

Datasets are a pivotal component in any machine learning task. Typically, a machine learning problem regards a dataset as given and uses it to train a model according to some specific objective. In this work, we depart from the traditional paradigm by instead optimizing a dataset with respect to a learning objective, from which the resulting dataset can be used in a range of downstream learning tasks. Our work is directly motivated by several challenges in existing learning methods. Kernel methods or instance-based learning (Vinyals et al., 2016; Snell et al., 2017; Kaya & Bilge, 2019) in general require a support dataset to be deployed at inference time. Achieving good prediction accuracy typically requires having a large support set, which inevitably increases both memory footprint and latency at inference time-the scalability issue. It can also raise privacy concerns when deploying a support set of original examples, e.g., distributing raw images to user devices. Additional challenges to scalability include, for instance, the desire for rapid hyper-parameter search (Shleifer & Prokop, 2019) and minimizing the resources consumed when replaying data for continual learning (Borsos et al., 2020) . A valuable contribution to all these problems would be to find surrogate datasets that can mitigate the challenges which occur for naturally occurring datasets without a significant sacrifice in performance.

This suggests the following

Question: What is the space of datasets, possibly with constraints in regards to size or signal preserved, whose trained models are all (approximately) equivalent to some specific model? In attempting to answer this question, in the setting of supervised learning on image data, we discover a rich variety of datasets, diverse in size and human interpretability while also robust to model architectures, which yield high performance or state of the art (SOTA) results when used as training data. We obtain such datasets through the introduction of a novel meta-learning algorithm called Kernel Inducing Points (KIP ). Figure 1 shows some example images from our learned datasets. Here, 500 labels were distilled from the CIFAR-10 train dataset using the the Myrtle 10-layer convolutional network. A test accuracy of 69.7% is achieved using these labels for kernel ridge-regression. We explore KIP in the context of compressing and corrupting datasets, validating its effectiveness in the setting of kernel-ridge regression (KRR) and neural network training on benchmark datasets MNIST and CIFAR-10. Our contributions can be summarized as follows: 1.1 SUMMARY OF CONTRIBUTIONS • We formulate a novel concept of -approximation of a dataset. This provides a theoretical framework for understanding dataset distillation and compression. • We introduce Kernel Inducing Points (KIP ), a meta-learning algorithm for obtainingapproximation of datasets. We establish convergence in the case of a linear kernel in Theorem 1. We also introduce a variant called Label Solve (LS ), which gives a closed-form solution for obtaining distilled datasets differing only via labels. • We explore the following aspects of -approximation of datasets: 1. Compression (Distillation) for Kernel Ridge-Regression: For kernel ridge regression, we improve sample efficiency by over one or two orders of magnitude, e.g. using 10 images to outperform hundreds or thousands of images (Tables 1, 2 vs Tables A1, A2 ). We obtain state of the art results for MNIST and CIFAR-10 classification while using few enough images (10K) to allow for in-memory inference (Tables A3, A4 ). 2. Compression (Distillation) for Neural Networks: We obtain state of the art dataset distillation results for the training of neural networks, often times even with only a single hidden layer fully-connected network (Tables 1 and 2 ). 3. Privacy: We obtain datasets with a strong trade-off between corruption and test accuracy, which suggests applications to privacy-preserving dataset creation. In particular, we produce images with up to 90% of their pixels corrupted with limited degradation in performance as measured by test accuracy in the appropriate regimes (Figures 3, A3, and Tables A5-A10 ) and which simultaneously outperform natural images, in a wide variety of settings. • We provide an open source implementation of KIP and LS , available in an interactive Colab notebookfoot_0 .

2. SETUP

In this section we define some key concepts for our methods.



https://colab.research.google.com/github/google-research/google-research/blob/master/kip/KIP.ipynb



Figure 1: (a) Learned samples of CIFAR-10 using KIP and its variant KIP ρ , for which ρ fraction of the pixels are uniform noise. Using 1000 such images to train a 1 hidden layer fully connected network results in 49.2% and 45.0% CIFAR-10 test accuracy, respectively, whereas using 1000 original CIFAR-10 images results in 35.4% test accuracy. (b) Example of labels obtained by label solving (LS ) (left two) and the covariance matrix between original labels and learned labels (right).Here, 500 labels were distilled from the CIFAR-10 train dataset using the the Myrtle 10-layer convolutional network. A test accuracy of 69.7% is achieved using these labels for kernel ridge-regression.

