LAVA: DATA VALUATION WITHOUT PRE-SPECIFIED LEARNING ALGORITHMS

Abstract

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over the state-of-the-art performance while being orders of magnitude faster.

1. INTRODUCTION

Advances in machine learning (ML) crucially rely on the availability of large, relevant, and highquality datasets. However, real-world data sources often come in different sizes, relevance levels, and qualities, differing in their value for an ML task. Hence, a fundamental question is how to quantify the value of individual data sources. Data valuation has a wide range of use cases both within the domain of ML and beyond. It can help practitioners enhance the model performance through prioritizing high-value data sources (Ghorbani & Zou, 2019) , and it allows one to make strategic and economic decisions in data exchange (Scelta et al., 2019) . In the past literature (Ghorbani & Zou, 2019; Jia et al., 2019b; Kwon & Zou, 2021) , data valuation is posed as a problem of equitably splitting the validation performance of a given learning algorithm among the training data. Formally, given a training dataset D t = {z i } N i=1 , a validation dataset D v , a learning algorithm A, and a model performance metric PERF (e.g., classification accuracy), a utility function is first defined over all subsets S ⊆ D t of the training data: U (S) := PERF(A(S)). Then, the objective of data valuation is to find a score vector s ∈ R N that represents the allocation to each datapoint. For instance, one simple way to value a point z i is through leave-one-out (LOO) error U (D t ) -U (D t \ {z i }), i.e., the change of model performance when the point is excluded from training. Most of the recent works have leveraged concepts originating from cooperative game theory (CGT), such as the Shapley value (Ghorbani & Zou, 2019; Jia et al., 2019b) , Banzhaf value (Wang & Jia, 2022), general semivalues (Kwon & Zou, 2021) , and Least cores (Yan & Procaccia, 2021) to value data. Like the LOO, all of these concepts are defined based on the utility function. Since the utility function is defined w.r.t. a specific learning algorithm, the data values calculated from the utility function also depend on the learning algorithm. In practice, there are many choice points pertaining to a learning algorithm, such as the model to be trained, the type of learning algorithm, as well as the hyperparameters. The detailed settings of the learning algorithms are often derived from data analysis. However, in many critical applications of data valuation such as informing data acquisition priorities and designing data pricing mechanism, data needs to be valued before the actual analysis and the choice points of the learning algorithm are still undetermined at that time. This gap presents a main hurdle for deploying existing data valuation schemes in the real world. The reliance on learning algorithms also makes existing data valuation schemes difficult to scale to large datasets. The exact evaluation of LOO error and CGT-based data value notions require evaluating utility functions over different subsets and each evaluation entails retraining the model on that subset: the number of retraining times is linear in the number of data points for the former, and exponential for the latter. While existing works have proposed a variety of approximation algorithms, scaling up the calculation of these notions to large datasets remains expensive. Further, learning-algorithm-dependent approaches rely on the performance scores associated with models trained on different subsets to determine the value of data; thus, they are susceptible to noise due to training stochasticity when the learning algorithm is randomized (e.g., SGD) (Wang & Jia, 2022) . This work addresses these limitations by introducing a learning-agnostic data valuation (LAVA) framework. LAVA is able to produce efficient and useful estimates of data value in a way that is oblivious to downstream learning algorithms. Our technical contributions are listed as follows. Proxy for validation performance. We propose a proxy for the validation performance associated with a training set based on the non-conventional class-wise Wasserstein distance (Alvarez-Melis & Fusi, 2020) between the training and the validation set. The hierarchically-defined Wasserstein distance utilizes a hybrid Euclidean-Wasserstein cost function to compare the feature-label pairs across datasets. We show that this distance characterizes the upper bound of the validation performance of any given models under certain Lipschitz conditions. Sensitivity-analysis-based data valuation. We develop a method to assess the value of an individual training point by analyzing the sensitivity of the particular Wasserstein distance to the perturbations on the corresponding probability mass. The values can be directly obtained for free from the output of off-the-shelf optimization solvers once the Wasserstein distance is computed. As the Wasserstein distance can be solved much more efficiently with entropy regularization (Cuturi, 2013) , in our experiments, we utilize the duals of the entropy-regularized program to approximate the sensitivity. Remarkably, we show that the gap between two data values under the original non-regularized Wasserstein distance can be recovered exactly from the solutions to the regularized program. State-of-the-art performance for differentiating data quality. We evaluate LAVA over a wide range of use cases, including detecting mislabeled data, backdoor attacks, poisoning attacks, noisy features, and task-irrelevant data, in which some of these are first conducted in the data valuation setting. Our results show that, surprisingly, the learning-agnostic feature of our framework enables a significant performance improvement over existing methods, while being orders of magnitude faster.

2. MEASURING DATASET UTILITY VIA OPTIMAL TRANSPORT

In this section, we consider the problem of quantifying training data utility U (D t ) without the knowledge of learning algorithms. Similar to most of the existing data valuation frameworks, we assume access to a set of validation points D v . Our idea is inspired by recent work on using the hierarchically-defined Wasserstein distance to characterize the relatedness of two datasets (Alvarez-Melis & Fusi, 2020). Our contribution here is to apply that particular Wasserstein distance to the data valuation problem and provide a theoretical result that connects the distance to validation performance of a model, which might be of independent interest.

2.1. OPTIMAL TRANSPORT-BASED DATASET DISTANCE

Background on Optimal Transport (OT). OT is a celebrated choice for measuring the discrepancy between probability distributions (Villani, 2009) . Compared to other notable dissimilarity measures such as the Kullback-Leibler Divergence (Kullback & Leibler, 1951) or Maximum Mean Discrepan-

availability

: https://github.com/ruoxi-jia-group/LAVA.

