DATA VALUATION WITHOUT TRAINING OF A MODEL

Abstract

Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model. Such attempts reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a datacentric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.

1. INTRODUCTION

Creation of large datasets has driven development of deep learning in diverse applications including computer vision (Krizhevsky et al., 2012; Dosovitskiy et al., 2021) , natural language processing (Vaswani et al., 2017; Brown et al., 2020) and reinforcement learning (Mnih et al., 2015; Silver et al., 2016) . To utilize the dataset in a more efficient and effective manner, some recent works have attempted to understand the role of individual data instances in training and generalization of neural networks. In (Ghorbani & Zou, 2019) , a metric to quantify the contribution of each training instance in achieving a high test accuracy was analyzed under the assumption that not only the training data but also the test data is available. Jiang et al. ( 2021) defined a score to identify irregular examples that need to be memorized during training, in order for the model to accurately classify the example. All these previous methods for data valuation require actual training of a model to quantify the role of individual instances at the model. Thus, the valuation itself often requires high-computational cost, which may contradict some motivations of data valuation. For example, in (Ghorbani & Zou, 2019; Jiang et al., 2021) , to examine the effect of individual data instances in training, one needs to train a model repeatedly while eliminating each instance or subsets of instances. In (Swayamdipta et al., 2020; Toneva et al., 2019) , on the other hand, training dynamics-the behavior of a model on each instance throughout the training-is analyzed to categorize data instances. When the motivation for data valuation lies at finding a subset of data that can approximate the full-dataset training to save the computational cost for training, the previous valuation methods might not be suitable, since they already require the training with the full dataset before one can figure out 'important' instances. In this paper, our main contribution is on defining a training-free data valuation score, which can be directly computed from data and can effectively quantify the impact of individual instances in optimization and generalization of neural networks. The proposed score, called complexity-gap score, measures the gap in data complexity where a certain data instance is removed from the full dataset. The data complexity measure was originally introduced in Arora et al. ( 2019) to quantify the complexity of the full dataset, which was used in bounding the generalization error of overparameterized

availability

Our code is publicly available at https://github.com/

