A comparison of dataset distillation and active learning in text classification

Abstract

Deep learning has achieved great success over the past few years in different aspects ranging from computer vision to natural language process. However, the huge size of data in deep learning has always been a thorny problem in learning the underlying distribution and tackling various human tasks. To alleviate this problem, knowledge distillation has been proposed to simplify the model, and later dataset distillation as a new method of reducing dataset sizes has been proposed, which aims to synthesize a small number of samples that contain all the information of a very large dataset. Meanwhile, active learning is also an effective method to reduce dataset sizes by only selecting the most significant labeling samples from the original dataset. In this paper, we explore the discrepancies in the principles of dataset distillation and active learning, and evaluate two algorithms on NLP classification dataset: Stanford Sentiment Treebank. The result of the experiment is that the distilled data with the size of 0.1% of the original text data achieves approximately 88% accuracy, while the selected data achieves 52% performance of the original data.

1. Introduction

Deep learning whose neural network has a large number of layers in order to solve complicate human tasks has achieved great success over the past few years in many various applications ranging from computer vision to natural language process. These networks have to be trained on a large number of data so as to learn the underlying distribution of the task. However, the huge computational complexity and the massive storage which deep learning requires are two main setbacks in the study. Therefore, data reduction has become a popular and feasible method to help people to learn the model and solve the problem by a small number of data. Nowadays, to improve the efficiency of deep neural networks, knowledge distillation (Hinton et al., 2015) has been proposed to compress the complex model into the simpler model and remain the approximately same accuracy as the original model. In knowledge distillation, the teacher model transfers the knowledge to the student model which has the less layers and storage. It asks the student model to achieve two goals: minimizing the loss function between the teacher model's soft prediction (softmax(T=t) i.e. T is the distillation temperature) and student model's soft prediction which means the distilled model need to achieve as much performances as the original model; minimizing the loss function between the student's hard prediction (softmax(T=1)) and the ground truth. It actually plays a role of model compression in deep learning which can reduce over-fitting and is widely used in data mining of infinitely large, unsupervised datasets; less or zero sample learning; migration learning and etc. Different from knowledge distillation which transfers the complex model to the simpler model, data distillation (Wang et al., 2018) is targeted to encapsulate all the knowledge from a large number of data to a small number of data. It is an alternative formulation to reduce the data sizes by synthesizing a small number of data points containing all the information of a large dataset that do not need to come from the correct true dataset distribution. Compared with other methods which aim to reduce the size of dataset such as prototype (Garcia et al., 2012) selection in nearest-neighbor classification, dataset distillation aims to generate a small number of datasets from the original data rather than selecting samples from the true distribution. It depends on backpropagated gradient to compress a dataset into a small set of synthetic data samples which is good for a complex neural network to be trained on a small number of data rather than the original massive dataset with the almost similar accuracy. Sometimes, we can regard deep learning as a situation that students always learn everything in detail from teachers in class while the amount of knowledge is very huge, however active learning is an effective way to learn parts of knowledge actively which is selected deliberately to master the knowledge to the maximum extent. It is a cyclic procedure between a teacher or an oracle (usually a human annotator) and an active learner (Schröder & Niekler, 2020). Compared with passive learning which is only to feed the data to the model, active learning choose which samples to be labeled by a human expert, the so-called human in the loop. In this paper, we discuss various discrepancies between two methods of reducing dataset sizes and use two algorithms to solve text classification on NLP classification dataset: Stanford Sentiment Treebank (SST) which consists of 21,5154 phrases from movies with fine-grained sentiment labels in the range of 0 to 1. We use the classification accuracy to compare the performances of active learning and dataset distillation on text classification.

2.1. Knowledge Distillation

Hinton et al. have proposed knowledge distillation as a method for imbuing smaller, more efficient networks with all the knowledge of their larger counterparts (Hinton et al., 2015) . And then knowledge distillation has been used in various aspects. Linfeng Zhang et al. have proposed self distillation, which notably enhances the accuracy of convolutional neural networks through shrinking the size of the network rather than aggrandizing it (Zhang et al., 2019) .To mitigate the massive number of parameters in deep neural networks, Sukmin Yun et al. have proposed a new regularization method that penalizes the predictive distribution between similar samples and distill the predictive distribution between different samples of the same label during training (Yun et al., 2020) . For casual distillation in language models, Zhengxuan Wu et al. have proposed to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training (Wu et al., 2021) .

2.2. Data Distillation

Data distillation has been proposed firstly by Tongzhou Wang as an alternative formulation to synthesize a small number of data from the massive original dataset (Wang et al., 2018) . It has been used in various industries. Tian Dong et al. have explored that dataset distillation is a better solution to replace the traditional data generators for private data generation because the distilled data is a set of synthetic data which will not disclose the original information (Dong et al., 2022) 



. Shengyuan Hu et al. have proposed a new scheme which includes gradient compression via dataset distillation in federated learning for upstream communication where every client learners use a light-weight distilled dataset as the training data instead of transmitting the model update (Hu et al., 2022) . To reduce catastrophic forgetting in deep neural networks, Wojciech Masarczyk et al. have investigated the use of synthetic data to generate two-step optimisation process via meta-gradients (Masarczyk & Tautkute, 2020). To alleviate the storage and time for training neural models on large-scale graphs in real world applications, Wei Jin et al. have proposed the condensation problem by imitating the GNN training trajectory on the original graph through the optimization of a gradient matching loss and design a strategy to condense node futures and structural information simultaneously (Jin et al., 2021). Yongqi Li et al. have developed a novel data

