LAVA: DATA VALUATION WITHOUT PRE-SPECIFIED LEARNING ALGORITHMS

Abstract

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over the state-of-the-art performance while being orders of magnitude faster.

1. INTRODUCTION

Advances in machine learning (ML) crucially rely on the availability of large, relevant, and highquality datasets. However, real-world data sources often come in different sizes, relevance levels, and qualities, differing in their value for an ML task. Hence, a fundamental question is how to quantify the value of individual data sources. Data valuation has a wide range of use cases both within the domain of ML and beyond. It can help practitioners enhance the model performance through prioritizing high-value data sources (Ghorbani & Zou, 2019) , and it allows one to make strategic and economic decisions in data exchange (Scelta et al., 2019) . In the past literature (Ghorbani & Zou, 2019; Jia et al., 2019b; Kwon & Zou, 2021) , data valuation is posed as a problem of equitably splitting the validation performance of a given learning algorithm among the training data. Formally, given a training dataset D t = {z i } N i=1 , a validation dataset D v , a learning algorithm A, and a model performance metric PERF (e.g., classification accuracy), a utility function is first defined over all subsets S ⊆ D t of the training data: U (S) := PERF(A(S)). Then, the objective of data valuation is to find a score vector s ∈ R N that represents the allocation to each datapoint. For instance, one simple way to value a point z i is through leave-one-out (LOO) error U (D t ) -U (D t \ {z i }), i.e., the change of model performance when the point is excluded from training. Most of the recent works have leveraged concepts originating from cooperative game theory (CGT), such as the Shapley value (Ghorbani & Zou, 2019; Jia et al., 2019b) , Banzhaf value (Wang cies (MMD) (Szekely et al., 2005) , the mathematically well-defined OT distance has advantageous analytical properties. For instance, OT is a distance metric, being computationally tractable and computable from finite samples (Genevay et al., 2018; Feydy et al., 2019) . The Kantorovich formulation (Kantorovich, 1942) defines the OT problem as a Linear Program (LP). Given probability measures µ t , µ v over the space Z, the OT problem is defined as OT(µ t , µ v ) := min π∈Π(µt,µv) Z 2 C(z, z ′ )dπ(z, z ′ ) where Π(µ t , µ v ) := π ∈ P(Z × Z) | Z π(z, z ′ )dz = µ t , Z π(z, z ′ )dz ′ = µ v denotes a collection of couplings between two distributions µ t and µ v and C : Z × Z → R + is some symmetric positive cost function (with C(z, z) = 0), respectively. If C(z, z ′ ) is the Euclidean distance between z and z ′ according to the distance metric d, then OT(µ t , µ v ) is 2-Wasserstein distance, which we denote as W C (µ t , µ v ) = W d (µ t , µ v ) := OT(µ t , µ v ). In this work, the notation OT and W are used interchangeably, with a slight difference that we use OT to emphasize various of its formulations while W specifies on which distance metric it is computed. Measuring Dataset Distance. We consider a multi-label setting where we denote f t : X → {0, 1} V , f v : X → {0, 1} V as the labeling functions for training and validation data, respectively, where V is the number of different labels. Given the training set D t = {(x i , f t (x i ))} N i=1 of size N , and the validation set D v = {(x ′ i , f v (x ′ i ))} M i=1 of size M , one can construct discrete measures µ t (x, y) := 1 N N i=1 δ (xi,yi) and µ v (x, y) := 1 M M i=1 δ (x ′ i ,y ′ i ) , where δ is Dirac function. Consider that each datapoint consists of a feature-label pair (x i , y i ) ∈ X × Y. While the Euclidean distance naturally provides the metric to measure distance between features, the distance between labels generally lacks a definition. Consequently, we define conditional distributions µ t (x|y) := µt(x)I[ft(x)=y] µt(x)I[ft(x)=y]dx and µ v (x|y) := µv(x)I[fv(x)=y] µv(x)I[fv(x)=y]dx . Inspired by Alvarez-Melis & Fusi (2020) , we measure the distance between two labels in terms of the OT distance between the conditional distributions of the features given each label. Formally, we adopt the following cost function between featurelabel pairs: C((x t , y t ), (x v , y v )) := d(x t , x v ) + cW d (µ t (•|y t ), µ v (•|y v )), where c ≥ 0 is a weight coefficient. We note that C is a distance metric since W d is a valid distance metric. With the definition of C, we propose to measure the distance between the training and validation sets using the non-conventional, hierarchically-defined Wasserstein distance between the corresponding discrete measures: W C (µ t , µ v ) = min π∈Π(µt,µv) Z 2 C (z, z ′ ) dπ (z, z ′ ) . Despite its usefulness and potentially broad applications, we note that it remains absent for existing research to explore its theoretical properties or establish applications upon this notion. This work aims to fill this gap by extending in both directions-novel analytical results are presented to provide its theoretical justifications while an original computing framework is proposed that extends its applications to a new scenario of datapoint valuation. Computational Acceleration via Entropic Regularization. Solving the problem above scales cubically with M N , which is prohibitive for large datasets. Entropy-regularized OT (entropy-OT) becomes a prevailing choice for approximating OT distances as it allows for fastest-known algorithms. Using the iterative Sinkhorn algorithm (Cuturi, 2013) with almost linear time complexity and memory overhead, entropy-OT can be implemented on a large scale with parallel computing (Genevay et al., 2018; Feydy et al., 2019) . Given a regularization parameter, ε > 0, entropy-OT can be formulated as OT ε (µ t , µ v ) := min π∈Π(µt,µv) Z 2 C(z, z ′ )dπ(z, z ′ )+εH(π|µ t ⊗µ v ), where H(π|µ t ⊗ µ v ) = Z 2 log dπ dµtdµv dπ. As ε → 0, the dual solutions to the ε-entropy-OT converge to its OT counterparts as long as the latter are unique (Nutz & Wiesel, 2021) .

2.2. LOWER Class-Wise Wasserstein Distance ENTAILS BETTER VALIDATION PERFORMANCE

In this paper, we propose to use W C , a non-conventional, class-wise Wasserstein distance w.r.t. the special distance function C defined in 2.1, as a learning-agnostic surrogate of validation performance to measure the utility of training data. Note that while Wasserstein distances have been frequently used to bound the learning performance change due to distribution drift (Courty et al., 2017; Damodaran et al., 2018; Shen et al., 2018; Ge et al., 2021) , this paper is the first to bound the performance change by the hierarchically-defined Wasserstein distance with respect to the hybrid cost C. Figure 1 provides an empirical justification for using this novel distance metric as a proxy, and presents a relation between the class-wise Wasserstein distance and a model's validation performance. Each curve represents a certain dataset trained on a specific model to receive its performance. Since, each dataset is of different size and structure, their distances will be of different scale. Therefore, we normalize the distances to the same scale to present the relation between the Wasserstein distance and model performance, which shows that despite different datasets and models, with increased distance, the validation performance decreases.  L : {0, 1} V × [0, 1] V → R + is k-Lipschitz in both inputs. Define cost function C between (x v , y v ) and (x t , y t ) as C((x t , y t ), (x v , y v )) := d(x t , x v )+cW d (µ t (•|y t ), µ v (•|y v )) , where c is a constant. Under a certain cross-Lipschitzness assumption for f t and f v detailed in Appendix A, we have E x∼µv(x) [L(f v (x), f (x))] ≤ E x∼µt(x) [L(f t (x), f (x))] + kϵW C (µ t , µ v ) + O(kV ). Proofs are deferred to Appendix A. The bound is interesting to interpret. The first term on the right-hand side corresponds to the training performance. In practice, when a model with large enough capacity is used, this term is small. The second one is the exact expression of the Wasserstein distance that we propose to use as a proxy for validation performance. The last error term is due to possible violation of the cross-Lipschitzness assumption for f t and f v . This term will be small if f t and f v assign the same label to close features with high probability. If the last term is small enough, it is possible to use the proposed Wasserstein distance as proxy for validation loss provided that f , f t and f v verify the cross-Lipschitz assumptions. The bound resonates with the empirical observation in Figure 1 that with lower distance between the training and the validation data, the validation loss of the trained model decreases.

3. EFFICIENT VALUATION OF INDIVIDUAL DATAPOINTS

Note that the class-wise Wasserstein distance defined in the previous section can be used to measure the utility for subsets of D t . Given this utility function, one can potentially use existing CGT-based notions such as the Shapley value to measure the contribution of individual points. However, even approximating these notions requires evaluating the utility function on a large number of subsets, which incurs large extra computation costs. In this section, we introduce a new approach to valuating individual points. Remarkably, our values can be directly obtained for free from the output of off-the-shelf optimization solvers once the proposed Wasserstein distance between the full training and testing datasets is computed.

3.1. DATAPOINT VALUATION VIA PARAMETER SENSITIVITY

OT distance is known to be insensitive to small differences while also being not robust to large deviations (Villani, 2021) . This feature is naturally suitable for detecting abnormal datapointsdisregarding normal variations in distances between clean data while being sensitive to abnormal distances of outlying points. We propose to measure individual points' contribution based on the gradient of the OT distance to perturbations on the probability mass associated with each point. Gradients are local information. However, unlike widely used influence functions that only hold for infinitesimal perturbation (Koh & Liang, 2017) , gradients for LP hold precisely in a local range and still encode partial information beyond that range, making it capable of reliably predicting the change to the OT distance due to adding or removing datapoints without the need of re-calculation. Also, the gradients are directed information, revealing both positive and negative contributions for each datapoint and allowing one to perform ranking of datapoints based on the gradient values. Finally, the OT distance always considers the collective effect of all datapoints in the dataset. Leveraging the duality theorem for LP, we rewrite the original OT problem (introduced in 2.1) in the equivalent form: OT(µ t , µ v ) := max (f,g)∈C 0 (Z) 2 ⟨f, µ t ⟩ + ⟨g, µ v ⟩, where C 0 (Z) is the set of all continuous functions, f and g are the dual variables. Let π * and (f * , g * ) be the corresponding optimal solutions to the primal and dual problems. The Strong Duality Theorem indicates that OT(π * (µ t , µ v )) = OT(f * , g * ), where the right-hand side is the distance parameterized by µ t and µ v . From the Sensitivity Theorem (Bertsekas, 1997) , we have that the gradient of the distance w.r.t. the probability mass of datapoints in the two datasets can be expressed as follows: ∇ µt OT(f * , g * ) = (f * ) T , ∇ µv OT(f * , g * ) = (g * ) T . Note that the original formulation in 2.1 is always redundant as the constraint N i=1 µ t (z i ) = M i=1 µ v (z ′ i ) = 1 is already implied, rendering the dual solution to be non-unique. To address this issue, we first remove any one of the constraints in Π(µ t , µ v ) and make the primal formulation non-degenerate. Then, we assign a value of zero to the dual variable corresponding to that removed primal constraint. When measuring the gradients of the OT distance w.r.t. the probability mass of a given datapoint in each dataset, we calculate the calibrated gradient as ∂ OT(µ t , µ v ) ∂µ t (z i ) = f * i - j∈{1,...N }\i f * j N -1 , ∂ OT(µ t , µ v ) ∂µ v (z ′ i ) = g * i - j∈{1,...M }\i g * j M -1 , which represents the rate of change in the OT distance w.r.t the change of the probability mass of a given datapoint along the direction ensuring the probability mass for all datapoints in the dataset always sums up to one (explicitly enforcing the removed constraint). The value of calibrated gradients is independent of the choice of selection during the constraint removal. Datapoint valuation via calibrated gradients. The calibrated gradients predict how the OT distance changes as more probability mass is shifted to a given datapoint. This can be interpreted as a measure of the contribution of the datapoint to the OT distance. The contribution can be positive or negative, suggesting shifting more probability mass to this datapoint would result in an increase or decrease of the dataset distance, respectively. If we want a training set to match the distribution of the validation dataset, then removing datapoints with large positive gradients while increasing datapoints with large negative gradients can be expected to reduce their OT distance. As we will show later, the calibrated gradients can provide a tool to detect abnormal or irrelevant data in various applications. Radius for accurate predictions. The Linear Programming theories (Bertsimas & Tsitsiklis, 1997) give that for each non-degenerate optimal solution, we are always able to perturb parameters on the right-hand side of primal constraints (Π(µ t , µ v ) in 2.1) in a small range without affecting the optimal solution to the dual problem. When the perturbation goes beyond a certain range, the dual solution becomes primal infeasible and the optimization problem needs to be solved again. Hence, the calibrated gradients are local information and we would like to know the perturbation radius such that the optimal dual solution remains unchanged-i.e., whether this range is large enough such that the calibrated gradients can accurately predict the actual change to the OT distance. If the perturbation goes beyond this range, the prediction may become inaccurate as the dual solution only encodes partial information about the optimization. In our evaluation, we find that this range is about 5% to 25% of the probability measure of the datapoint (µ (•) (z i )) for perturbations in both directions and the pattern seems independent of the size of the datasets. This range being less than the probability mass of a datapoint suggests that we are only able to predict the change to the OT distance for removing/adding a datapoint to the dataset approximately, though, the relative error is well acceptable (depicted in Figure 2 ).

3.2. PRECISE RECOVERY OF RANKING FOR DATA VALUES OBTAINED FROM ENTROPY-OT

Due to computational advantages of the entropy-OT (defined in Eq. 2.1), one needs to resort to the solutions to entropy-OT to calculate data values. We quantify the deviation in the calibrated gradients caused by the entropy regularizer. This analysis provides foundations on the potential impact of the deviation on the applications built on these gradients. Theorem 2. Let OT(µ t , µ v ) and OT ε (µ t , µ v ) be the original formulation and entropy penalized formulation (as defined in 2.1) for the OT problem between the empirical measures µ  for any i ̸ = j ̸ = k ∈ {1, 2, . . . , N } and o ̸ = p ̸ = q ∈ {1, 2, . . . , M }, the difference between the calibrated gradients for two datapoints z i and z k in dataset D t and the difference for z ′ p and z ′ q in D v can be calculated as ∂ OT(µt, µv) ∂ µt(zi) - ∂ OT(µt, µv) ∂ µt(z k ) = ∂ OTε(µt, µv) ∂ µt(zi) - ∂ OTε(µt, µv) ∂ µt(z k ) -ε• N N -1 • 1 (π * ε ) kj - 1 (π * ε )ij , ∂ OT(µt, µv) ∂ µv(z ′ p ) - ∂ OT(µt, µv) ∂ µv(z ′ q ) = ∂ OTε(µt, µv) ∂ µv(z ′ p ) - ∂ OTε(µt, µv) ∂ µv(z ′ q ) -ε• M M -1 • 1 (π * ε )qo - 1 (π * ε )po , where π * ε is the optimal primal solution to the entropy penalized OT problem defined in 2.1, z j is any datapoint in D t other than z i or z k , and z ′ o is any datapoint in D v other than z ′ p or z ′ q . 9) G-Shapley (G-SV) (Ghorbani & Zou, 2019) . Baselines ( 6)-( 9) are, however, computationally infeasible for the scale of data that we study here. So we exclude them from the evaluation of efficacy in different use cases. We also provide a detailed runtime comparison of all baselines. For all methods to be compared, a validation set of 10, 000 samples is assumed. For our method, we first use the validation data to train a deep neural network model PreActResNet18 (He et al., 2016) from scratch for feature extraction. Then, from its output, we compute the class-wise Wasserstein distance and the calibrated gradients for data valuation. Details about datasets, models, hyperparameter settings, and ablation studies of the hyperparameters and validation sizes are provided in Appendix B. We evaluate on five different use cases of data valuation: detecting backdoor attack, poisoning attack, noisy features, mislabeled data, and irrelevant data. The first four are conventional tasks in the literature and the last one is a new case. All of them have a common goal of identifying "low-quality" training points. To achieve this goal, we rank datapoints in ascending order of their values and remove some number of points with lowest data values. For each removal budget, we calculate the detection rate, i.e., the percentage of the points that are truly bad within the removed points. Backdoor Attack Detection. A popular technique of introducing backdoors to models is by injecting maliciously constructed data into a training set (Zeng et al., 2021) . At test time, any trained model would misclassify inputs patched with a backdoor trigger as the adversarially-desired target class. In the main text, we consider the Trojan Square attack, a popular attack algorithm (Liu et al., 2017) , which injects training points that contain a backdoor trigger and are relabeled as a target class. The evaluation of other types of backdoor attacks can be found in Appendix B. To simulate this attack, we select the target attack class Airplane and poison 2500 (5%) samples of the total CIFAR-10 training set (50k) with a square trigger. In Figure 3 I.(a), we compare the detection rates of different data valuation methods. LAVA and TracIn-Clean outperform the others by a large margin. In particular, for LAVA, the first 20% of the points that it removes contain at least 80% of the poisoned data. We also evaluate whether the model trained after the removal still suffers from the backdoor vulnerability. To perform this evaluation, we calculate the attack accuracy, i.e., the accuracy of the model trained on the remaining points to predict backdoored examples as the target label. A successful data removal would yield a lower attack accuracy. Figure 3 I.(b) shows that our method already takes effect in the early stages, whereas other baselines can start defending from the attack only after removing over 13, 000 samples. The efficacy of LAVA is in part attributable to inspection of distances between both features and labels. The backdoored training samples that are poisoned to the target class will be "unnatural" in that class, i.e., they have a large feature distance from the original samples in the target class. While the poisoned examples contain a small feature perturbation compared to the natural examples from some other classes, their label distance to them is large because their labels are altered. Poisoning Attack Detection. Poisoning attacks are similar to backdoor attacks in the sense that they both inject adversarial points into the training set to manipulate the prediction of certain test examples. However, poisoning attacks are considered unable to control test examples. We consider a popular attack termed "feature-collision" attack (Shafahi et al., 2018) , where we select a target sample from the Cat class test set and blend the selected image with the chosen target class training samples, Frog in our case. In this attack, we do not modify labels and blend the Cat image only into 50 (0.1%) samples of Frog, which makes this attack especially hard to detect. During inference time, we expect the attacked model to consistently classify the chosen Cat as a Frog. In Figure 3 II.(a), we observe that LAVA outperforms all baselines and achieves an 80% detection rate by removing only 11k samples, which is around 60% fewer samples than the highest baseline. Computational Efficiency. So far, we have focused on the method's performance without considering the actual runtime. We compare the runtime-performance tradeoff on the CIFAR-10 example of 2000 samples with 10% backdoor data, a scale in which every baseline can be executed in a reasonable time. As shown in Figure 5 , our method achieves a significant improvement in efficiency while being able to detect bad data more effectively. Dependence on Validation Data Size. For current experiments, we have assumed the validation set of size 10K. Such a scale of data is not hard to acquire, as one can get high-quality data from crowdsourcing platforms, such as Amazon Mechanical Turk for $12 per each 1K samples (AWS, 2019). While our method achieves remarkable performance when using 10K validation data, we perform ablation study on much smaller sets (Appendix B.2.1), where LAVA, notably, can still outperform other baselines. As an example on mislabeled data detection, our method with 2K validation data achieves 80% detection rate at data removal budget of 25K (Fig. 9 ), whereas the best performing baseline achieves such a performance with 5 times bigger validation data, 10K (Fig. 3 IV.(a)). Furthermore, even on a tiny validation set of size 500, LAVA consistently outperforms all the baselines with the same validation size (Fig. 11 ). This shows that our method remains effective performance for various sizes of validation data. While it can be thought of as a learningagnostic data valuation method, it is not as effective and efficient as our method in distinguishing data quality. Xu et al. (2021) propose to use the volume to measure the utility of a dataset. Volume is agnostic to learning algorithms and easy to calculate because is defined simply as the square root of the trace of feature matrix inner product. However, the sole dependence on features makes it incapable of detecting bad data caused by labeling errors. Moreover, to evaluate the contribution of individual points, the authors propose to resort to the Shapley value, which would still be expensive for large datasets.

6. DISCUSSION AND OUTLOOK

This paper describes a learning-agnostic data valuation framework. In particular, in contrast to existing methods which typically adopt model validation performance as the utility function, we approximate the utility of a dataset based on its class-wise Wasserstein distance to a given validation set and provide theoretical justification for this approximation. Furthermore, we propose to use the calibrated gradients of the OT distance to value individual datapoints, which can be obtained for free if one uses an off-the-shelf solver to calculate the Wasserstein distance. Importantly, we have tested on various datasets, and our LAVA framework can significantly improve the state-of-the-art performance of using data valuation methods to detect bad data while being substantially more efficient. Due to the stochasticity of ML and the inherent tolerance to noise, it is often challenging to identify low-quality data by inspecting their influence on model performance scores. The take-away from our empirical study is that despite being extensively adopted in the past, low-quality data detection through model performance changes is actually suboptimal; lifting the dependence of data valuation on the actual learning process provides a better pathway to distinguish data quality. Despite the performance and efficiency improvement, our work still has some limitations. As a result, it opens up many new investigation venues: (1) How to further lift the dependence on validation data? While a validation set representative of the downstream learning task is a common assumption in the ML literature, it may or may not be available during data exchange. (2) Our design could be vulnerable to existing poisons that directly or indirectly minimize the similarity to clean data (Huang et al., 2021; Pan et al., 2022) . Further investigation into robust data valuation would be intriguing. (3) Our current method does not have enough flexibility for tasks that aim for goals beyond accuracy, e.g., fairness. Folding other learning goals in is an exciting direction. (4) Customizing the framework to natural language data is also of practical interest. Intuitively, given labeling functions f t , f v and a coupling π, we can bound the probability of finding pairs of training and validation instances labelled differently in a (1/ϵ)-ball with respect to π. Our Assumptions. Assuming that f is an ϵ-Lipschitz function. Given a metric function d(•, •), we define a cost function C between (x t , y t ) and (x v , y v ) as C((x t , y t ), (x v , y v )) := d(x t , x v ) + cW d (µ t (•|y t ), µ v (•|y v )), where c is a constant. Let π * x,y be the coupling between µ ft t , µ fv v such that π * x,y := arg inf π∈Π(µ f t t ,µ fv v ) E ((xt,yt),(xv,yv))∼π [C((x t , y t ), (x v , y v ))]. We define two couplings π * and π * between µ t (x), µ v (x) as follows: π * (x t , x v ) := Y Y π * x,y ((x t , y t ), (x v , y v )) dy t dy v . For π * , we first need to define a coupling between µ ft , µ fv : π * y (y t , y v ) := X X π * x,y ((x t , y t ), (x v , y v )) dx t dx v (8) and another coupling between µ ft t , µ fv v : π * x,y ((x t , y t ), ( x v , y v )) := π * y (y t , y v )µ t (x t |y t )µ v (x v |y v ). Finally, π * is constructed as follows: π * (x t , x v ) := Y Y π * y (y t , y v )µ t (x t |y t )µ v (x v |y v ) dy t dy v . It is easy to see that all joint distributions defined above are couplings between the corresponding distribution pairs. We assume that f t , f v are (ϵ tv , δ tv )-probabilistic cross-Lipschitz with respect to π * in metric d. Additionally, we assume that ϵ tv /ϵ ≤ c and the loss function L is k-Lipschitz in both inputs. Besides, from their definitions above, we have that ∥f (x)∥, ∥f t (x)∥, ∥f v (x)∥ ≤ V . The assumption of probabilistic cross-Lipschitzness would be violated only when the underlying coupling assigns large probability to pairs of training-validation features that are close enough (within 1/ϵ tv -ball) but labeled differently. However, π * is generally not such a coupling. Note that π * is the optimal coupling between training and validation distributions that minimizes a cost function C pertaining to both feature and label space. Hence, π * y (y t , y v ), the marginal distribution of π * over the training and validation label space, tends to assign high probability to those label pairs that agree. On the other hand, π * x,y can be thought of as a coupling that first generates training-validation labels from π * y and then generates the features in each dataset conditioning on the corresponding labels. Hence, the marginal distribution π * of training-validation feature pairs generated by π * x,y would assign high likelihood to those features with the same labels. So, conceptually, the probabilistic cross-Lipschitzness assumption should be easily satisfied by π * .

A.3 DETAILED PROOF

Theorem 1 (restated). Given the above assumptions, we have E x∼µv(x) [L(f v (x), f (x))] ≤ E x∼µt(x) [L(f t (x), f (x))] + kϵW C (µ ft t , µ fv v ) + 2kV δ tv . ( ) Proof. E x∼µv(x) [L(f v (x), f (x))] (12) = E x∼µv(x) [L(f v (x), f (x))] -E x∼µt(x) [L(f t (x), f (x))] + E x∼µt(x) [L(f t (x), f (x))] (13) ≤ E x∼µt(x) [L(f t (x), f (x))] + E x∼µv(x) [L(f v (x), f (x))] -E x∼µt(x) [L(f t (x), f (x))] . ( ) We bound E x∼µv(x) [L(f v (x), f (x))] -E x∼µt(x) [L(f t (x), f (x)) ] as follows: E x∼µv (x) [L(fv(x), f (x))] -E x∼µ t (x) [L(ft(x), f (x))] (15) = X 2 [L(fv(xv), f (xv)) -L(ft(xt), f (xt))] dπ * (xt, xv) (16) = X 2 [L(fv(xv), f (xv)) -L(fv(xv), f (xt)) + L(fv(xv), f (xt)) -L(ft(xt), f (xt))] dπ * (xt, xv) (17) ≤ X 2 |L(fv(xv), f (xv)) -L(fv(xv), f (xt))| dπ * (xt, xv) U 1 (18) + X 2 |L(fv(xv), f (xt)) -L(ft(xt), f (xt))| dπ * (xt, xv) U 2 , ( ) where the last inequality is due to triangle inequality. Now, we bound U 1 and U 2 separately. For U 1 , we have U 1 ≤ k X 2 ∥f (x v ) -f (x t )∥ dπ * (x t , x v ) (20) ≤ kϵ X 2 d(x t , x v ) dπ * (x t , x v ), where both inequalities are due to Lipschitzness of L and f . In order to bound U 2 , we first recall that π * y (y t , y v ) = X X π * x,y ((x t , y t ), (x v , y v )) dx t dx v and π * x,y ((x t , y t ), ( x v , y v )) := π * y (y t , y v )µ t (x t |y t )µ v (x v |y v ): Observe that U 2 = X 2 Y 2 |L(f v (x v ), f (x t )) -L(f t (x t ), f (x t ))| dπ * x,y ((x t , y t ), (x v , y v )) (22) = Y 2 X 2 |L(y v , f (x t )) -L(y t , f (x t ))| dπ * x,y ((x t , y t ), (x v , y v )) (23) ≤ k Y 2 X 2 ∥y v -y t ∥ dπ * x,y ((x t , y t ), (x v , y v )) (24) = k Y 2 ∥y v -y t ∥ dπ * y (y t , y v ), where the second equality is due to a condition that if y t ̸ = f t (x t ) or y v ̸ = f v (x v ), then π * x,y ((x t , y t ), (x v , y v )) = 0. Now we can bound U 2 as follows: U 2 ≤ k Y 2 ∥y v -y t ∥ dπ * y (y t , y v ) (26) = k X 2 Y 2 ∥y v -y t ∥ d π * x,y ((x t , y t ), (x v , y v )) (27) = k Y 2 X 2 ∥f v (x v ) -f t (x t )∥ d π * x,y ((x t , y t ), (x v , y v )), where the last step holds since if y t ̸ = f t (x t ) or y v ̸ = f v (x v ) then π * x,y ((x t , y t ), (x v , y v )) = 0. Define the region A = {(x t , x v ) : ∥f v (x v ) -f t (x t )∥ < ϵ tv d(x t , x v )}, then k Y 2 X 2 ∥f v (x v ) -f t (x t )∥ d π * x,y ((x t , y t ), (x v , y v )) (29) = k Y 2 X 2 \A ∥f v (x v ) -f t (x t )∥ d π * x,y ((x t , y t ), (x v , y v )) (30) + k Y 2 A ∥f v (x v ) -f t (x t )∥ d π * x,y ((x t , y t ), (x v , y v )) (31) ≤ k Y 2 X 2 \A 2V d π * x,y ((x t , y t ), (x v , y v )) (32) + k Y 2 A ∥f v (x v ) -f t (x t )∥ d π * x,y ((x t , y t ), (x v , y v )). ( ) Let's define ft (x t ) = f t (x t ) and fv (x v ) = f v (x v ) if (x t , x v ) ∈ A, and ft (x t ) = fv (x v ) = 0 otherwise (note that ∥ fv (x v ) -ft (x t )∥ ≤ ϵ tv d(x t , x v ) for all (x t , x v ) ∈ X 2 ), then we can bound the second term as follows: k Y 2 A ∥f v (x v ) -f t (x t )∥ d π * x,y ((x t , y t ), (x v , y v )) (34) ≤ k Y 2 dπ * y (y t , y v ) A ∥f v (x v ) -f t (x t )∥ dµ t (x t |y t )dµ v (x v |y v ) (35) = k Y 2 dπ * y (y t , y v ) X 2 fv (x v ) -ft (x t ) dµ t (x t |y t )dµ v (x v |y v ) (36) = k Y 2 dπ * y (y t , y v ) E xv∼µv(•|yv) [ fv (x v )] -E xt∼µv(•|yt) [ ft (x t )] (37) ≤ kϵ tv Y 2 dπ * y (y t , y v )W d (µ t (•|y t ), µ v (•|y v )). Inequality ( 38) is a consequence of the duality form of the Kantorovich-Rubinstein theorem (Villani (2021) , Chapter 1). Combining two parts, we have U 2 ≤ k Y 2 X 2 \A 2V d π * x,y ((x t , y t ), (x v , y v )) + kδ tv Y 2 dπ * y (y t , y v )W d (µ t (•|y t ), µ v (•|y v )) ≤ 2kV δ tv + kϵ tv Y 2 dπ * y (y t , y v )W d (µ t (•|y t ), µ v (•|y v )), where the last step is due to the probabilistic cross-Lipschitzness of f t , f v with respect to π * x,y . Now, combining the bound for U 1 and U 2 , we have E x∼µv(x) [L(f v (x), f (x))] -E x∼µt(x) [L(f t (x), f (x))] ≤ kϵ X 2 d(x t , x v )dπ(x t , x v ) + 2kV δ tv + kϵ tv Y 2 dπ * y (y t , y v )W d (µ t (•|y t ), µ v (•|y v )) (43) = k (X ×Y) 2 [ϵd(x t , x v ) + ϵ tv W d (µ t (•|y t ), µ v (•|y v ))] dπ * x,y ((x t , y t ), (x v , y v )) + 2kV δ tv (44) ≤ k (X ×Y) 2 [ϵd(x t , x v ) + cϵW d (µ t (•|y t ), µ v (•|y v ))] dπ * x,y ((x t , y t ), (x v , y v )) + 2kV δ tv (45) = kϵE π * x,y [C((x t , y t ), (x v , y v ))] + 2kV δ tv (46) = kϵW C (µ ft t , µ fv v ) + 2kV δ tv , where the last step is due to the definition of π * x,y . This leads to the final conclusion. Theorem 5 (restated). Let OT(µ t , µ v ) and OT ε (µ t , µ v ) be the original formulation and entropy penalized formulation (as defined in Subsection 2.1) for the OT problem between the empirical measures µ t and µ v associated with the two data sets D t and D v , respectively. Then, for any i ̸ = j ̸ = k ∈ {1, 2, ...N } and o ̸ = p ̸ = q ∈ {1, 2, ...M }, the difference between the calibrated gradients for two datapoints z i and z k in dataset D t and the difference for z ′ p and z ′ q in D v can be calculated as ∂ OT(µt, µv) ∂ µt(zi) - ∂ OT(µt, µv) ∂ µt(z k ) = ∂ OTε(µt, µv) ∂ µt(zi) - ∂ OTε(µt, µv) ∂ µt(z k ) -ε • N N -1 • 1 (π * ε ) kj - 1 (π * ε )ij , ∂ OT(µt, µv) ∂ µv(z ′ p ) - ∂ OT(µt, µv) ∂ µv(z ′ q ) = ∂ OTε(µt, µv) ∂ µv(z ′ p ) - ∂ OTε(µt, µv) ∂ µv(z ′ q ) -ε • M M -1 • 1 (π * ε )oq - 1 (π * ε )op , where π * ε is the optimal primal solution to the entropy penalized OT problem, z j is any datapoint in D t other than z i or z k , z ′ o is any datapoint in D v other than z ′ p or z ′ q , |D t | = N , and |D v | = M . Proof. Let L(π, f, g) and L ε (π ε , f ε , g ε ) be the Lagrangian functions for original formulation and entropy penalized formulation between the datasets D t and D v , respectively, which can be written as L(π, f, g) = ⟨π, c⟩ + N i=1 f i • (π ′ i • I N -a i ) + M j=1 g j • (I ′ M • π j -b j ), L ε (π ε , f ε , g ε ) = ⟨π ε , c⟩ + ε • N i=1 M j=1 log (π ε ) ij µ t (z i ) • µ v (z j ) + N i=1 (f ε ) i • [(π ε ) ′ i • I M -µ t (z i ))] + M j=1 (g ε ) j • [I ′ N • (π ε ) j -µ v (z j )], where c N ×M is the cost matrix consisting of distances between N datapoints in D t and M datapoints in D v , I N = (1, 1, ...1) ∈ R N ×1 and I ′ M = (1, 1, ...1) T ∈ R 1×M , π and (f, g) denote the primal and dual variables, and π ′ i and π j denote the i th row and j th column in matrix π, respectively. The first-order necessary condition for optima in Lagrangian Multiplier Theorem gives that ∇L π (π * , f * , g * ) = 0 and ∇(L ε ) π ((π ε ) * , (f ε ) * , (g ε ) * ) = 0, where π * and (f * , g * ) denote the optimal solutions to the primal and dual problems, respectively. Thus, for any i ∈ {1, 2, . . . , N } and j ∈ {1, 2, . . . , M }, we have ∇L π (π * , f * , g * ) ij = c ij + f * i + g * j = 0, ∇(L ε ) π (π * ε , f * ε , g * ε ) ij = c ij + ε • 1 (π * ε ) ij + (f ε ) * i + (g ε ) * j = 0. Subtracting, we have [f * i -(f ε ) * i ] + g * j -(g ε ) * j -ε • 1 (π * ε ) ij = 0. Then, for ∀k ̸ = i ∈ {1, 2, ...N }, we have [f * k -(f ε ) * k ] + g * j -(g ε ) * j -ε • 1 (π * ε ) kj = 0. Subtracting and reorganizing, we get [(f ε ) * i -(f ε ) * k ] = (f * i -f * k ) -ε • 1 (π * ε ) ij - 1 (π * ε ) kj . From the definition of the calibrated gradients in Eq.1, we have ∂ OT(µ t , µ v ) ∂µ t (z i ) - ∂ OT(µ t , µ v ) ∂µ t (z k ) = N N -1 (f * i -f * k ) , ∂ OT ε (µ t , µ v ) ∂µ t (z i ) - ∂ OT ε (µ t , µ v ) ∂µ t (z k ) = N N -1 [(f ε ) * i -(f ε ) * k ] . Finally, subtracting and reorganizing, we have ∂ OTε(µt, µv) ∂µt(zi) - ∂ OTε(µt, µv) ∂µt(z k ) = ∂ OT(µt, µv) ∂µt(zi) - ∂ OT(µt, µv) ∂µt(z k ) -ε • N N -1 • 1 (π * ε )ij - 1 (π * ε ) kj . The proof for the second part of the Theorem is similar. ∂ OTε(µt, µv) ∂µv(z ′ p ) - ∂ OTε(µt, µv) ∂µv(z ′ q ) = ∂ OT(µt, µv) ∂µv(z ′ p ) - ∂ OT(µt, µv) ∂µv(z ′ q ) -ε • M M -1 • 1 (π * ε )op - 1 (π * ε )oq . Then the proof is complete. Recall the class-wise Wasserstein distance is defined with respect to the following distance metric: C((x t , y t ), ( x v , y v )) = d(x t , x v ) + cW d (µ t (•|y t ), µ v (•|y v )). Actually, one can change the relative weight between feature distance d(x t , x v ) and the label distance W d (µ t (•|y t ), µ v (•|y v )). Here, we show the effect of upweighting the feature distance, while keeping the label weight at 1 and the results are illustrated in Figure 9 (a). As we are moving away from uniform weight, the performance on detection rate is decreasing with larger feature weights. With feature weight of 100, our method performs similarly as the random detector. Indeed, as we increase weight on the features, the weight on the label distance is decreased. As the weight reaches 100, our method performs similarly as the feature embedder without knowing label information and hence, the mislabeled detection performance is comparable to the random baseline.

B.2.3 LABEL WEIGHT

Next, we shift focus to label weight. We examine the effect of upweighting the label distance, while keeping the feature weight at 1. In Figure 9 (b), as the label weight increases, the detection rate performance deteriorates. When we increase the label distance, the feature information becomes neglected, which is not as effective as the balanced weights between feature and label distances.

B.2.4 FEATURE EMBEDDER

We use feature embedder to extract features for the feature distance part in our method. We train the feature embedder on the accessible validation set until the convergence on the train accuracy. Different architectures of the embedder might be sensitive to different aspects of the input and thus result in different feature output. Nevertheless, as we observe in Figure 10 , the detection performance associated with different model architectures of feature embedder is similar. Hence, in practice, one can flexibly choose the feature embedder to be used in tandem with our method as long as it has large enough capacity. Furthermore, we note that that these feature embedders Although machine leaning practitioners might be using clean data for training a model, the dataset can be often unbalanced which can lead to model performance degradation (Thai-Nghe et al., 2009) . To recover higher model accuracy, we can rebalance unbalanced datasets by removing points that cause such disproportion. We showcase how LAVA effectively rebalance the dataset by removing points with poor values and keeping points with best values. We consider a CIFAR-10 dataset with a class Frog being unbalanced and containing 5, 000 samples while other classes have only half as much (i.e. 2, 500 samples). In Figure 12 , we demonstrate the effectiveness of LAVA valuation which not only reduces the dataset by removing poor value points but also improves the model accuracy. While at the same time other valuation methods were not able to steadily increase the model accuracy and quickly downgraded the model performance, which in turn shows an even stronger effectiveness of our method.

B.4 REDUCING TRAINING SET SIZE

With the growing size of the training dataset, the computation cost and memory overhead naturally increase which might deem impos-sible for some practitioners with limited resources to train a model. Therefore, the ability to reduce the training dataset size (Sener & Savarese, 2018) will free up some computation burden and thus allow ones with limited resources to fully appreciate the model training process. Motivated by the given challenge, we want to leverage our data valuation method to significantly decrease the training dataset size while maintaining the model performance. Similarly as in the previous section, the idea is to keep a subset of datapoints with best values and remove poor valued ones. To demonstrate the effectiveness of our LAVA's valuation, we perform such a task on a clean CIFAR-10 dataset with 2, 500 samples from each class and compare with other data valuation methods. As presented in Figure 13 , it is demonstrated that the performance is well maintained even with smaller subsets of the original dataset. Remarkably, even reducing a clean training set (25,000 samples) by 15% based on our method's valuation, the performance can still stay relatively high while outperforming other valuation baselines. With growing dataset sizes, grows the space needed to store data. Thus, the buyer often would like to decrease the dataset to minimize resources but to retain the performance. Unlike reducing training set size as provided in Section B.4, in this experiment, we will select a smaller, representative subset of the whole dataset that can maintain good performance. To measure the performance of each subset, we measure the validation performance of the model trained on that subset subtracted by the validation performance of the model trained on a random subset of the same size, the experiment which is performed in Kwon & Zou (2021) . In Figure 16we can observe that our method can select a small subset that performs better than the subsets chosen by the baseline methods most of the time.

B.6 SCALABILITY EXPERIMENT

In the main paper, we have demonstrated time complexity comparison between LAVA and other valuation methods. We have reported runtime comparisons only for 2,000 test samples as this is the scale existing methods can solve in a not excessively long time (within a day). It showcases the advantageous computing efficiency that the proposed approach enjoys over other methods. We further want to emphasize the computational efficiency of LAVA and demonstrate computation efficiency on a larger scale dataset (100,000 samples) with higher dimensions, ImageNet-100. Additionally, we evaluate other baselines which are able to finish within a day of computation to highlight the advantage of our method as presented in Table 1 . Moreover, we highlight the near-linear time complexity of LAVA on CIFAR-10, which shows practical computation efficiency of our method as shown in Figure 17 . As we have provided the results of the Trojan square attack (Trojan-SQ) (Liu et al., 2017) in Section 4, we now apply LAVA to other backdoor attacks, which are Hello Kitty blending attack (Blend) (Chen et al., 2017) and Trojan watermark attack (Trojan-WM) (Liu et al., 2017) , and evaluate the efficacy of our method in detecting different types of backdoor attacks. We simulate these attacks by selecting the target class Airplane and poisoning 2, 500 (5%) samples of the CIFAR-10 dataset of size 50, 000. The backdoor trigger adopted in each attack is portrayed in Figure 8 . In Figure 18 , we observe that our method can achieve superior detection performance on all the attacks considered. The reason is that despite the difference in trigger pattern, all of these attacks modify both the label and the feature of a poisoned image and thus result in the deviation of our distributional distance that is defined over the product space of feature and label. One concern in the real-world data marketplace is that data is freely replicable. However, replicates of data introduce no new information and therefore the prior work has argued that a data utility function should be robust to direct data copying (Xu et al., 2021) . One advantage of using the class-wise Wasserstein distance to measure data utility is that it is robust to duplication. Our method by its natural distributional formulation will ignore duplicates sets. As shown in Table 3 , although we have repeated the set even five times more than the original source set, the distance remains the same. Additionally, with small noise changes in the features, the distance metric is barely affected. Another concern in the real-world marketplace is that one might find a single data that has highest contribution and duplicate it to maximize the profit. However, again due to the nature of our distributional formulation, duplicating a single point multiple times would increase the distance between the training and the validation set due to the imbalance in training distribution caused by copying that point.

B.9 DETAILED EXPERIMENTAL SETTINGS

Datasets and Models. Table 2 summarizes the details of the dataset, the models, as well as their licenses adopted in our experiments. Hardware. A server with an NVIDIA Tesla P100-PCIE-16GB graphic card is used as the hardware platform in this work.

Software. Dataset OT Dist Size

Direct Near 5000 195.64 +0.98 2 × 5000 +0.00 +0.98 3 × 5000 +0.00 +0.98 4 × 5000 +0.00 +0.98 4.5 × 5000 +0.07 +0.98 5 × 5000 +0.00 +0.98 Table 3 : Class-wise Wasserstein distance behavior under dataset direct duplication and its near duplicates. For our implementation, we use PyTorch for the main framework (Paszke et al., 2019) , assisted by three main libraries, which are otdd (optimal transport calculation setup with datasets) (Alvarez-Melis & Fusi, 2020) , geomloss (actual optimal transport calculation) (Feydy et al., 2019) , and numpy (tool for array routines) (Harris et al., 2020) . 



The gradient difference on the left-hand side of (2) represents the groundtruth value difference between two training points z i and z k as the values are calculated based on the original OT formulation. In practice, for the sake of efficiency, one only solves the regularized formulation instead and, therefore, this groundtruth difference cannot be obtained directly. Theorem 2 nevertheless indicates a very interesting fact that one can calculate the groundtruth difference based on the solutions to the regularized problem, because every term in the right-hand side only depends on the solutions to the regularized problem. Particularly, the groundtruth value difference is equal to the value difference produced by the regularized solutions plus some calibration terms that scale with ε(Nutz & Wiesel, 2021). This result indicates that while it is not possible to obtain individual groundtruth value by solving the regularized problem, one can actually exactly recover the groundtruth value difference based on the regularized solutions. In many applications of data valuation such as data selection, it is the order of data values that matters(Kwon & Zou, 2021). For instance, to filter out low-quality data, one would first rank the datapoints based on their values and then throw the points with lowest values. In these applications, solving the entropy-regularized program is an ideal choice-which is both efficient and recovers the exact ranking of datapoint values. Finally, note that Eq. 3 presents a symmetric result for the calibrated gradients for validation data. In our experiments, we set ϵ = 0.1, rendering the corresponding calibration terms to be negligible. As a result, we can directly use the calibrated gradients solved by the regularized program to rank datapoint values.4 EXPERIMENTSIn this section, we demonstrate the practical efficacy and efficiency of LAVA on various classification datasets. We compare with nine baselines: (1) Influence functions (INF)(Koh & Liang, 2017), which approximates the LOO error with first-order extrapolation; (2) TracIn-Clean(Pruthi et al., 2020), which accumulates the loss change on validation data during training whenever the training point of interest is sampled; (3) TracIn-Self(Pruthi et al., 2020), which is similar to TracIn-Clean but accumulates the training loss changes; (4) KNN-Shapley (KNN-SV)(Jia et al., 2019a), which



Figure 1: Normalized Wasserstein distance vs. model performance on different datasets and models.

Figure 2: Predicting the change to the OT distance for increasing/reducing the probability mass of a point. The OT distance is calculated between two subsets of CIFAR-10. We examine the change to the OT distance predicted by the calibrated gradients against the actual change. The results of two datapoints are visualized for demonstration (I.(a)/(b) and II.(a)/(b) are analyzed on datapoint #1 and #2, respectively). The probability mass of the datapoint is perturbed from -100% (removing the datapoint) to 100% (duplicating the datapoint). I.(a) and II.(a): Predicted change on the OT distance against the actual change. The predicted change demonstrated high consistency to the actual change despite minor deviation for large perturbation. I.(b) and II.(b): Relative error for the prediction, defined as (predicted_change -actual_change)/actual_change×100%. The color bar represents the theoretical range of perturbation where the change in the OT distance can be accurately predicted. The prediction holds approximately well beyond the range.

Figure 3: Performance comparison between LAVA and baselines on various use cases. For I.(b), we depict the Attack Accuracy, where lower value indicates more effective detection. For II.(b), we depict the Attack Confidence, as lower confidence indicates better poison removal. For III.(b) and IV.(b), we show the model test accuracy, where higher accuracy means effective data removal.

Figure 3 II.(b) shows that by removing data according to LAVA ranking, the target model has reduced the confidence of predicting the target Cat sample as a Frog to below 40%. Our technique leverages the fact that the features from a different class are mixed with the features of the poisoned class, which increases the feature distance between the poisoned and non-poisoned Frog examples.Noisy Feature Detection. While adding small Gaussian noises to training samples may benefit model robustness(Rusak et al., 2020), strong noise, such as due to sensor failure, can significantly affect the model performance. We add strong white noise to 25% of all CIFAR-10 dataset without changing any labels. Our method performs extremely well as shown in Figure3III.(a) and detects all 12,500 noisy samples by inspecting less than 15,000 samples. This explains the sudden drop of the model's accuracy at the removal budget of 15,000 samples in Figure3III.(b): the model starts throwing away only clean samples from that point. LAVA performs well in this scenario since the strong noise increases the feature distance significantly.

Figure 4: Left: Heatmap of inter-class distance of CIFAR-10. Right: Examples of irrelevant data. The detection rates of first 500 inspected images are 94% and 46% for Deer-Truck, Cat-Dog, respectively.Mislabeled Data Detection. Due to the prevalence of human labeling errors(Karimi et al., 2020), it is crucial to detect mislabeled samples. We shuffle labels of 25% samples in the CIFAR-10 dataset to random classes. Unlike backdoor and poisoning attacks, this case is especially harder to detect since wrong samples are spread out throughout classes instead of all placed inside a target class. However, as shown in Figure3IV.(a), LAVA's detection rate outperforms other baselines and the model performance is maintained even after 20k of removed data (Figure IV.(b)). Irrelevant Data Detection. Often the collected datasets through web scraping have irrelevant samples in given classes (Northcutt et al., 2021; Tsipras et al., 2020), e.g., in a class of Glasses, we might have both water glass and eyeglasses due to lack of proper inspection or class meaning specification. This case is different from the mislabeled data scenario, in which case the training features are all relevant to the task. Since the irrelevant examples are highly likely to have completely different features than the desired class representation, LAVA is expected to detect these examples. We design an experiment where we remove all images of one specific class from the classification output but split them equally to the other remaining classes as irrelevant images. As shown in Figure 4, the detection result over a class varies based on the distance between that class and the class from which irrelevant images are drawn. For instance, when Deer images are placed into the Truck class, we can detect almost 94% of all Deer images within first 500 removed images. On the other hand, when we place Cat images into dog class, our detection rate drops to 45% within the top 500.

Figure 5: Runtime v.s. Detection Error comparison between LAVA and baselines on inspecting 2000 samples from CIFAR-10 with 10% backdoor data.Existing data valuation methods include LOO and influence function(Koh & Liang, 2017), the Shapley value(Jia et al., 2019b;Ghorbani & Zou, 2019;Wang & Jia, 2023), the Banzhaf value(Wang & Jia, 2022), Least Cores(Yan & Procaccia, 2021), Beta Shapley(Kwon & Zou, 2021), and reinforcement learning-based method(Yoon et al., 2020). However, they all assume the knowledge of the underlying learning algorithms and suffer large computational complexity. The work ofJia et al.  (2019a)  has proposed to use K-Nearest Neighbor Classifier as a default proxy model to perform data valuation. While it can be thought of as a learningagnostic data valuation method, it is not as effective and efficient as our method in distinguishing data quality.Xu et al. (2021) propose to use the volume to measure the utility of a dataset. Volume is agnostic to learning algorithms and easy to calculate because is defined simply as the square root of the trace of feature matrix inner product. However, the sole dependence on features makes it incapable of detecting bad data caused by labeling errors. Moreover, to evaluate the contribution of individual points, the authors propose to resort to the Shapley value, which would still be expensive for large datasets.

Figure 9: (a) Comparison between different feature weights on the performance of mislabeled data in CIFAR-10. (b) Comparison between different label weights on the performance of mislabeled data in CIFAR-10. (c) Comparison between different validation sizes on inspecting 50k samples from CIFAR-10 with 25% mislabeled data.leverage the full validation size of 10, 000. Additionally, when restricting LAVA and other baselines to validation set of 500 samples, our method is better than the best baseline for detecting mislabeled data in the 50k CIFAR-10 samples with 25% being mislabeled as shown in Figure11.

Figure 11: Detection rate by various methods on mislabeled CIFAR-10 using validation size of 500.

Figure 14: Detection performance comparison between LAVA and the model trained on validation data of size 500 on various use cases in CIFAR-10.

Figure 13: Comparison of various methods on reducing a dataset size based on valuation of datapoints on CIFAR-10.

Figure 16: Comparison of various methods on data summarization based on valuation of datapoints on CIFAR-10.

Figure 17: Near-linear time complexity of LAVA shown on CIFAR-10.

Figure 18: Detection rate of various backdoor attacks by LAVA.

B.8 IMPLICATIONS OF THE PROPOSED DATA VALUATION METHOD TO REAL-WORLD DATA MARKETPLACES Comparison of runtime between various methods needed to valuate ImageNet-100.

Summary of datasets and models and their licenses used for experimental evaluation. (Note: "Detail" column implicitly refers to training data, unless explicitly noted.)

7. ACKNOWLEDGEMENTS

RJ and the ReDS Lab gratefully acknowledge the support from the Cisco Research Award, the Virginia Tech COE Fellowship, and the NSF CAREER Award. Jiachen T. Wang is supported by Princeton's Gordon Y. S. Wu Fellowship. YZ is supported by the Amazon Fellowship.

availability

: https://github.com/ruoxi-jia-group/LAVA.

APPENDIX A RESTATEMENT OF THEOREMS AND FULL PROOFS

In this section, we will restate our main results and give full proofs.A.1 SUMMARY OF NOTATIONS Let µ t , µ v be the training distribution and validation distribution, respectively. We denote f t : X → {0, 1} V , f v : X → {0, 1} V as the labeling functions for training and validation data, where V is the number of different labels. We can then denote the joint distribution of random datalabel pairs (x, f t (x)) x∼µt(x) and (x, f v (x)) x∼µv(x) as µ ft t and µ fv v , respectively, which are the same notations as µ t and µ v but made with explicit dependence on f t and f v for clarity. The distributions of (f t (x)) x∼µt(x) , (f v (x)) x∼µv(x) are denoted as µ ft , µ fv , respectively. Besides, we define conditional distributions µ t (x|y) := µt(x)I[ft(x)=y] µt(x)I[ft(x)=y]dx and µ v (x|y) := µv(x)I[fv(x)=y] µv(x)I[fv(x)=y]dx . Let f : X → [0, 1] V be the model trained on training data and L : {0, 1} V × [0, 1] V → R + be the loss function. We denote π ∈ Π(µ 1 , µ 2 ) as a coupling between a pair of distributions µ 1 , µ 2 and d : X × X → R as a distance metric function.The 1-Wasserstein distance with respect to distance function d between two distributions µ 1 , µ 2 is defined as

A.2 STATEMENT OF ASSUMPTIONS

To prove Theorem 1, we need the concept of probabilistic cross-Lipschitzness, which assumes that two labeling functions should produce consistent labels with high probability on two close instances. Definition 3 (Probabilistic Cross-Lipschitzness). Two labeling functions f t : X → {0, 1} V and f v : X → {0, 1} V are (ϵ, δ)-probabilistic cross-Lipschitz w.r.t. a joint distribution π over X × X if for all ϵ > 0: In the main text, we have focused our evaluation on CIFAR-10. Here, we provide experiments to show effectiveness of LAVA on diverse datasets for detecting bad data.Backdoor Attack Detection. We evaluate another type of backdoor attack (Section 4), which is the Hello Kitty blending attack (Blend) (Chen et al., 2017 ) that mixes the target class sample with the Hello Kitty image, as illustrated in Figure 8 (B). We attack the German Traffic Sign dataset (GTSRB) on the target class 6 by poisoning 1764 (5%) samples of the whole dataset. Our method achieves the highest detection rate, as shown in Figure 6 (a). In particular, the 5000 points with lowest data values contain all poisoned data based on the LAVA data values, while the second best method on this task, KNN-SV, can cover all poisoned examples with around 11,000 samples. Our algorithm performs especially well for this attack, since the label of poisoned data is changed to the target class and the patching trigger is large. Both the label and feature changes contribute to the increase of the OT distance and thus ease the detection.Noisy Feature Detection. Here, we show the usage of LAVA on the MNIST dataset where 25% of the whole dataset is contaminated by feature noise. Our method still outperforms all the baselines by detecting all noisy data within first 14,000 samples, which is 5,000 less than the best baseline would require, which is shown in Figure 6 (b). Irrelevant Data. We perform another irrelevant data detection experiment and focus on the CIFAR100 dataset.In Figure 7 , we illustrate some of the irrelevant samples detected by LAVA. Intuitively, irrelevant data in the class should be easily detected by LAVA, since the images are far from the representative of the class and increasing the probability mass associated with these images leads to larger distributional distance to the clean validation data.

B.2 ABLATION STUDY

We perform an ablation study on validation size and on the hyperparameters in our method, where we provide insights on the impact of setting changes. We use the mislabeled detection use case and the CIFAR-10 dataset as an example setting for the ablation study. For all the experiments in the main text, we use the validation set of size 10,000. Naturally, we want to examine the effect of the size of the validation set on the detection rate of mislabeled data. In Figure 9 (c), we illustrate the performance on the detection rate with smaller validation data sizes: 200, 500, 2, 000, and 5, 000. We observe that even reducing the validation set by half to 5, 000 can largely maintain the detection rate performance. Small validation sets (200, 500, 2, 000) degrade the detection rate by more than 50%. Despite the performance degradation, our detection performance with these small validation sizes is in fact comparable with the baselines in Figure 3 IV.(a) that

