DATASET CONDENSATION WITH GRADIENT MATCHING

Abstract

As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them become significantly more expensive. This paper proposes a training set synthesis technique for data-efficient learning, called Dataset Condensation, that learns to condense large dataset into a small set of informative synthetic samples for training deep neural networks from scratch. We formulate this goal as a gradient matching problem between the gradients of deep neural network weights that are trained on the original and our synthetic data. We rigorously evaluate its performance in several computer vision benchmarks and demonstrate that it significantly outperforms the state-of-the-art methods 1 . Finally we explore the use of our method in continual learning and neural architecture search and report promising gains when limited memory and computations are available.

1. INTRODUCTION

Large-scale datasets, comprising millions of samples, are becoming the norm to obtain state-ofthe-art machine learning models in multiple fields including computer vision, natural language processing and speech recognition. At such scales, even storing and preprocessing the data becomes burdensome, and training machine learning models on them demands for specialized equipment and infrastructure. An effective way to deal with large data is data selection -identifying the most representative training samples -that aims at improving data efficiency of machine learning techniques. While classical data selection methods, also known as coreset construction (Agarwal et al., 2004; Har-Peled & Mazumdar, 2004; Feldman et al., 2013) , focus on clustering problems, recent work can be found in continual learning (Rebuffi et al., 2017; Toneva et al., 2019; Castro et al., 2018; Aljundi et al., 2019) and active learning (Sener & Savarese, 2018) where there is typically a fixed budget in storing and labeling training samples respectively. These methods commonly first define a criterion for representativeness (e.g. in terms of compactness (Rebuffi et al., 2017; Castro et al., 2018 ), diversity (Sener & Savarese, 2018; Aljundi et al., 2019) , forgetfulness (Toneva et al., 2019) ), then select the representative samples based on the criterion, finally use the selected small set to train their model for a downstream task. Unfortunately, these methods have two shortcomings: they typically rely on i) heuristics (e.g. picking cluster centers) that does not guarantee any optimal solution for the downstream task (e.g. image classification), ii) presence of representative samples, which is neither guaranteed. A recent method, Dataset Distillation (DD) (Wang et al., 2018) goes beyond these limitations by learning a small set of informative images from large training data. In particular, the authors model the network parameters as a function of the synthetic training data and learn them by minimizing the training loss over the original training data w.r.t. synthetic data. Unlike in the coreset methods, the synthesized data are directly optimized for the downstream task and thus the success of the method does not rely on the presence of representative samples. Inspired from DD (Wang et al., 2018) , we focus on learning to synthesize informative samples that are optimized to train neural networks for downstream tasks and not limited to individual samples in original dataset. Like DD, our goal is to obtain the highest generalization performance with a model trained on a small set of synthetic images, ideally comparable performance to that of a model trained on the original images (see Figure 1(a) ). In particular, we investigate the following questions. Is it possible to i) compress a large image classification dataset into a small synthetic set, ii) train an image classification model on the synthetic set that can be further used to classify real images, iii) learn a single set of synthetic images that can be used to train different neural network architectures? To this end, we propose a Dataset Condensation method to learn a small set of "condensed" synthetic samples such that a deep neural network trained on them obtains not only similar performance but also a close solution to a network trained on the large training data in the network parameter space. We formulate this goal as a minimization problem between two sets of gradients of the network parameters that are computed for a training loss over a large fixed training set and a learnable condensed set (see Figure 1(b) ). We show that our method enables effective learning of synthetic images and neural networks trained on them, outperforms (Wang et al., 2018) and coreset methods with a wide margin in multiple computer vision benchmarks. In addition, learning a compact set of synthetic samples also benefits other learning problems when there is a fixed budget on training images. We show that our method outperforms popular data selection methods by providing more informative training samples in continual learning. Finally, we explore a promising use case of our method in neural architecture search, and show that -once our condensed images are learned -they can be used to train numerous network architectures extremely efficiently. Our method is related to knowledge distillation (KD) techniques (Hinton et al., 2015; Buciluǎ et al., 2006; Ba & Caruana, 2014; Romero et al., 2014 ) that transfer the knowledge in an ensemble of models to a single one. Unlike KD, we distill knowledge of a large training set into a small synthetic set. Our method is also related to Generative Adversarial Networks (Goodfellow et al., 2014a; Mirza & Osindero, 2014; Radford et al., 2015) In the remainder of this paper, we first review the problem of dataset condensation and introduce our method in section 2, present and analyze our results in several image recognition benchmarks in section 3.1, showcase applications in continual learning and network architecture search in section 3.2, and conclude the paper with remarks for future directions in section 4. i=1 where x ∈ X ⊂ R d , y ∈ {0, . . . , C -1}, X is a d-dimensional input space and C is the number of classes. We wish to learn a differentiable function φ (i.e. deep neural network)



The implementation is available at https://github.com/VICO-UoE/DatasetCondensation.



CONDENSATION Suppose we are given a large dataset consisting of |T | pairs of a training image and its class label T = {(x i , y i )}| |T |

Figure1: Dataset Condensation (left) aims to generate a small set of synthetic images that can match the performance of a network trained on a large image dataset. Our method (right) realizes this goal by learning a synthetic set such that a deep network trained on it and the large set produces similar gradients w.r.t. its weights. The synthetic data can later be used to train a network from scratch in a small fraction of the original computational load. CE denotes Cross-Entropy.

and Variational AutoEncoders (Kingma & Welling, 2013)   that synthesize high-fidelity samples by capturing the data distribution. In contrast, our goal is to generate informative samples for training deep neural networks rather than to produce "real-looking" samples. Finally our method is related to the methods that produce image patches by projecting the feature activations back to the input pixel space(Zeiler & Fergus, 2014), reconstruct the input image by matching the feature activations(Mahendran & Vedaldi, 2015), recover private training images for given training gradients(Zhu et al., 2019; Zhao et al., 2020), synthesize features from semantic embeddings for zero-shot learning(Sariyildiz & Cinbis, 2019). Our goal is however to synthesize a set of condensed training images not to recover the original or missing training images.

