DATASET CONDENSATION WITH GRADIENT MATCHING

Abstract

As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them become significantly more expensive. This paper proposes a training set synthesis technique for data-efficient learning, called Dataset Condensation, that learns to condense large dataset into a small set of informative synthetic samples for training deep neural networks from scratch. We formulate this goal as a gradient matching problem between the gradients of deep neural network weights that are trained on the original and our synthetic data. We rigorously evaluate its performance in several computer vision benchmarks and demonstrate that it significantly outperforms the state-of-the-art methods 1 . Finally we explore the use of our method in continual learning and neural architecture search and report promising gains when limited memory and computations are available.

1. INTRODUCTION

Large-scale datasets, comprising millions of samples, are becoming the norm to obtain state-ofthe-art machine learning models in multiple fields including computer vision, natural language processing and speech recognition. At such scales, even storing and preprocessing the data becomes burdensome, and training machine learning models on them demands for specialized equipment and infrastructure. An effective way to deal with large data is data selection -identifying the most representative training samples -that aims at improving data efficiency of machine learning techniques. While classical data selection methods, also known as coreset construction (Agarwal et al., 2004; Har-Peled & Mazumdar, 2004; Feldman et al., 2013) , focus on clustering problems, recent work can be found in continual learning (Rebuffi et al., 2017; Toneva et al., 2019; Castro et al., 2018; Aljundi et al., 2019) and active learning (Sener & Savarese, 2018) where there is typically a fixed budget in storing and labeling training samples respectively. These methods commonly first define a criterion for representativeness (e.g. in terms of compactness (Rebuffi et al., 2017; Castro et al., 2018) , diversity (Sener & Savarese, 2018; Aljundi et al., 2019) , forgetfulness (Toneva et al., 2019) ), then select the representative samples based on the criterion, finally use the selected small set to train their model for a downstream task. Unfortunately, these methods have two shortcomings: they typically rely on i) heuristics (e.g. picking cluster centers) that does not guarantee any optimal solution for the downstream task (e.g. image classification), ii) presence of representative samples, which is neither guaranteed. A recent method, Dataset Distillation (DD) (Wang et al., 2018) goes beyond these limitations by learning a small set of informative images from large training data. In particular, the authors model the network parameters as a function of the synthetic training data and learn them by minimizing the training loss over the original training data w.r.t. synthetic data. Unlike in the coreset methods, the synthesized data are directly optimized for the downstream task and thus the success of the method does not rely on the presence of representative samples. Inspired from DD (Wang et al., 2018) , we focus on learning to synthesize informative samples that are optimized to train neural networks for downstream tasks and not limited to individual samples in original dataset. Like DD, our goal is to obtain the highest generalization performance with a model trained on a small set of synthetic images, ideally comparable performance to that of a model trained on the original images (see Figure 1(a) ). In particular, we investigate the following



The implementation is available at https://github.com/VICO-UoE/DatasetCondensation. 1

