ESTIMATING INFORMATIVENESS OF SAMPLES WITH SMOOTH UNIQUE INFORMATION

Abstract

We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples. Our work generalizes existing frameworks but enjoys better computational properties for heavily overparametrized models, which makes it possible to apply it to real-world networks.

1. INTRODUCTION

Training a deep neural network (DNN) entails extracting information from samples in a dataset and storing it in the weights of the network, so that it may be used in future inference or prediction. But how much information does a particular sample contribute to the trained model? The answer can be used to provide strong generalization bounds (if no information is used, the network is not memorizing the sample), privacy bounds (how much information the network can leak about a particular sample), and enable better interpretation of the training process and its outcome. To determine the information content of samples, we need to define and compute information. In the classical sense, information is a property of random variables, which may be degenerate for the deterministic process of computing the output of a trained DNN in response to a given input (inference). So, even posing the problem presents some technical challenges. But beyond technicalities, how can we know whether a given sample is memorized by the network and, if it is, whether it is used for inference? We propose a notion of unique sample information that, while rooted in information theory, captures some aspects of stability theory and influence functions. Unlike most information-theoretic measures, ours can be approximated efficiently for large networks, especially in the case of transfer learning, which encompasses many real-world applications of deep learning. Our definition can be applied to either "weight space" or "function space." This allows us to study the non-trivial difference between information the weights possess (weight space) and the information the network actually uses to make predictions on new samples (function space). Our method yields a valid notion of information without relying on the randomness of the training algorithm (e.g., stochastic gradient descent, SGD), and works even for deterministic training algorithms. Our main work-horse is a first-order approximation of the network. This approximation is accurate when the network is pre-trained (Mu et al., 2020) -as is common in practical applications -or is randomly initialized but very wide (Lee et al., 2019) , and can be used to obtain a closed-form expression of the per-sample information. In addition, our method has better scaling with respect to the number of parameters than most other information measures, which makes it applicable to mas-sively over-parametrized models such as DNNs. Our information measure can be computed without actually training the network, making it amenable to use in problems like dataset summarization. We apply our method to remove a large portion of uninformative examples from a training set with minimum impact on the accuracy of the resulting model (dataset summarization). We also apply our method to detect mislabeled samples, which we show carry more unique information. To summarize, our contributions are (1) We introduce a notion of unique information that a sample contributes to the training of a DNN, both in weight space and in function space, and relate it with the stability of the training algorithm; (2) We provide an efficient method to compute unique information even for large networks using a linear approximation of the DNN, and without having to train a network; (3) We show applications to dataset summarization and analysis. The implementation of the proposed method and the code for reproducing the experiments is available at https:// github.com/awslabs/aws-cv-unique-information. Prerequisites and Notation. Consider a dataset of n labeled examples S = {z i } n i=1 , where z i = (x i , y i ), x i ∈ X and y i ∈ R k and a neural network model f w : X → R k with parameters w ∈ R d . Throughout the paper S -i = {z 1 , . . . , z i-1 , z i+1 , . . . , z n } denotes the set excluding the ith sample; f wt is often shortened to f t ; the concatenation of all training examples is denoted by X; the concatenation of all training labels by Y ∈ R nk ; and the concatenation of all outputs by f w (X) ∈ R nk . The loss on the i-th example is denoted by L i (w) and is equal to 1 2 f w (x i ) -y i 2 2 , unless specified otherwise. This choice is useful when dealing with linearized models and is justified by Hui & Belkin (2020), who showed that the mean-squared error (MSE) loss is as effective as crossentropy for classification tasks. The total loss is L(w) = n i=1 L i (w) + λ 2 w -w 0 2 2 , where λ ≥ 0 is a weight decay regularization coefficient and w 0 is the weight initialization point. Note that the regularization term differs from standard weight decay w 2 2 and is more appropriate for linearized neural networks, as it allows us to derive the dynamics analytically (see Sec. F of the appendix). Finally, a (possibly stochastic) training algorithm is denoted with a mapping A : S → W, which maps a training dataset S to classifier weights W = A(S). Since the training algorithm can be stochastic, W is a random variable. The distribution of possible weights W after training with the algorithm A on the dataset S is denoted with p A (w | S). We use several information-theoretic quantities, such as entropy: H(X) = -E log p(x) , mutual information: I(X; Y ) = H(X) + H(Y ) -H(X, Y ), Kullback-Leibler divergence: KL(p(x)||q(x)) = E x∼p(x) [log(p(x)/q(x))] and their conditional variants (Cover & Thomas, 2006) . If y ∈ R m and x ∈ R n , then the Jacobian ∂y ∂x is an m × n matrix. The gradient ∇ x y denotes transpose of the Jacobian ∂y ∂x , an n × m matrix.

2. RELATED WORK

Our work is related to information-theoretic stability notions (Bassily et al., 2016; Raginsky et al., 2016; Feldman & Steinke, 2018 ) that seek to measure the influence of a sample on the output, and to measure generalization. Raginsky et al. ( 2016) define information stability as E S 1 n n i=1 I(W ; Z i | S -i ) , the expected average amount of unique (Shannon) information that weights have about an example. This, without the expectation over S, is also our starting point (eq. 1). Bassily et al. ( 2016 . The latter closely resembles our definition (eq. 4). Unfortunately, while the weights are continuous, the optimization algorithm (such as SGD) is usually discrete. This generally makes the resulting quantities degenerate (infinite). Most works address this issue by replacing the discrete optimization algorithm with a continuous one, such as stochastic gradient Langevin dynamics (Welling & Teh, 2011) or continuous stochastic differential equations that approximate SGD (Li et al., 2017) in the limit. We aim to avoid such assumptions and give a definition that is directly applicable to real networks trained with standard algorithms. To do this, we apply a smoothing procedure to a standard discrete algorithm. The final result can still be interpreted as a valid bound on Shannon mutual information, but for a slightly modified optimization algorithm. Our definitions relate informativeness of a sample to the notion of algorithmic stability (Bousquet & Elisseeff, 2002; Hardt et al., 2015) , where a training algorithm A is called stable if A(S) is close to A(S ) when the datasets S and S differ by only one sample.



) define KL-stability sup S,S KL(p A (w | S) p A (w | S )), where S and S are datasets that differ by one example, while Feldman & Steinke (2018) define average leaveone-out KL stability as sup S 1 n n i=1 KL(p A (w | S) p A (w | S -i ))

