A comparison of dataset distillation and active learning in text classification

Abstract

Deep learning has achieved great success over the past few years in different aspects ranging from computer vision to natural language process. However, the huge size of data in deep learning has always been a thorny problem in learning the underlying distribution and tackling various human tasks. To alleviate this problem, knowledge distillation has been proposed to simplify the model, and later dataset distillation as a new method of reducing dataset sizes has been proposed, which aims to synthesize a small number of samples that contain all the information of a very large dataset. Meanwhile, active learning is also an effective method to reduce dataset sizes by only selecting the most significant labeling samples from the original dataset. In this paper, we explore the discrepancies in the principles of dataset distillation and active learning, and evaluate two algorithms on NLP classification dataset: Stanford Sentiment Treebank. The result of the experiment is that the distilled data with the size of 0.1% of the original text data achieves approximately 88% accuracy, while the selected data achieves 52% performance of the original data.

1. Introduction

Deep learning whose neural network has a large number of layers in order to solve complicate human tasks has achieved great success over the past few years in many various applications ranging from computer vision to natural language process. These networks have to be trained on a large number of data so as to learn the underlying distribution of the task. However, the huge computational complexity and the massive storage which deep learning requires are two main setbacks in the study. Therefore, data reduction has become a popular and feasible method to help people to learn the model and solve the problem by a small number of data. Nowadays, to improve the efficiency of deep neural networks, knowledge distillation (Hinton et al., 2015) has been proposed to compress the complex model into the simpler model and remain the approximately same accuracy as the original model. In knowledge distillation, the teacher model transfers the knowledge to the student model which has the less layers and storage. It asks the student model to achieve two goals: minimizing the loss function between the teacher model's soft prediction (softmax(T=t) i.e. T is the distillation temperature) and student model's soft prediction which means the distilled model need to achieve as much performances as the original model; minimizing the loss function between the student's hard prediction (softmax(T=1)) and the ground truth. It actually plays a role of model compression in deep learning which can reduce over-fitting and is widely used in data mining of infinitely large, unsupervised datasets; less or zero sample learning; migration learning and etc. Different from knowledge distillation which transfers the complex model to the simpler model, data distillation (Wang et al., 2018) is targeted to encapsulate all the knowledge from a large number of data to a small number of data. It is an alternative formulation to reduce the data sizes by synthesizing a small number of data points containing all the 1

