UNDERSTANDING INFLUENCE FUNCTIONS AND DATA-MODELS VIA HARMONIC ANALYSIS

Abstract

Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in Koh & Liang (2017) . They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et al. ( 2022) introduced a linear regression method they termed datamodels to predict the effect of training points on outputs on test data. The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.

1. INTRODUCTION

It is often of great interest to quantify how the presence or absence of a particular training data point affects the trained model's performance on test data points. Influence functions is a classical idea for this (Jaeckel, 1972; Hampel, 1974; Cook, 1977) that has recently been adapted to modern deep models and large datasets Koh & Liang (2017) . Influence functions have been applied to explain predictions and produce confidence intervals (Schulam & Saria, 2019) , investigate model bias (Brunet et al., 2019; Wang et al., 2019) , estimate Shapley values (Jia et al., 2019; Ghorbani & Zou, 2019) , improve human trust (Zhou et al., 2019) , and craft data poisoning attacks (Koh et al., 2019) . Influence actually has different formalizations. The classic calculus-based estimate (henceforth referred to as continuous influence) involves conceptualizing training loss as a weighted sum over training datapoints, where the weighting of a particular datapoint z can be varied infinitesimally. Using gradient and Hessian one obtains an expression for the rate of change in test error (or other functions) of z ′ with respect to (infinitesimal) changes to weighting of z. Though the estimate is derived only for infinitesimal change to the weighting of z in the training set, in practice it has been employed also as a reasonable estimate for the discrete notion of influence, which is the effect of completely adding/removing the data point from the training dataset (Koh & Liang, 2017) . Informally speaking, this discrete influence is defined as f (S ∪ {i})f (S) where f is some function of the test points, S is a training dataset and i is the index of a training point. (This can be noisy, so several papers use expected influence of i by taking the expectation over random choice of S of a certain size; see Section 2.) Koh & Liang (2017) as well as subsequent papers have used continuous influence to estimate the effect of decidedly non-infinitesimal changes to the dataset, such as changing the training set by adding or deleting entire groups of datapoints (Koh et al., 2019) . Recently Bae et al. (2022) show mathematical reasons why this is not well-founded, and give a clearer explanation (and alternative implementation) of Koh-Liang style estimators. Yet another idea related to influence functions is linear datamodels in Ilyas et al. (2022) . By training many models on subsets of p fraction of datapoints in the training set, the authors show that some interesting measures of test error (defined using logit values) behave as follows: the measure f (x) is well-approximable as a (sparse) linear expression θ 0 + i θ i x i , where x is a binary vector denoting a sample of p fraction of training datapoints, with x i = 1 indicating presence of i-th training point and x i = -1 denoting absence. The coefficients θ i are estimated via lasso regression. The surprise here is that f (x) -which is the result of deep learning on dataset x-is well-approximated by θ 0 + i θ i x i . The authors note that the θ i 's can be viewed as heuristic estimates for the discrete influence of the ith datapoint. The current paper seeks to provide better theoretical understanding of above-mentioned phenomena concerning discrete influence functions. At first sight this quest appears difficult. The calculus definition of influence functions (which as mentioned is also used in practice to estimate the discrete notions of influence) involves Hessians and gradients evaluated on the trained net, and thus one imagines that any explanation for properties of influence functions must await better mathematical understanding of datasets, net architectures, and training algorithms. Surprisingly, we show that the explanation for many observed properties turns out to be fairly generic. Our chief technical tool is harmonic analysis, and especially theory of noise stability of functions (see O'Donnell (2014) for an excellent survey). 1.1 OUR CONCEPTUAL FRAMEWORK (DISCRETE INFLUENCE) Training data points are numbered 1 through N , but the model is being trained on a random subset of data points, where each data point is included independently in the subset with probability p. (This is precisely the setting in linear datamodels.) For notational ease and consistency with harmonic analysis, we denote this subset by x ∈ {-1, +1} N where +1 means the corresponding data point was included. We are interested in some quantity f (x) associated with the trained model on one or more test data points. Note f is a probabilistic function of x due to stochasticity in deep net training -SGD, dropout, data augmentation etc. -but one can average over the stochastic choices and think of f as deterministic function f : {±1} N → R. (In practice, this means we estimate f (x) by repeating the training on x, say, 10 to 50 times.) This scenario is close to classical study of boolean functions via harmonic analysis, except our function is real-valued. Using those tools we provide the following new mathematical understanding: 1. We give reasons for existence of datamodels of Ilyas et al. (2022) , the phenomenon that functions related to test error are well-approximable by a linear function θ 0 + i θ i x i . See Section 3.1. 2. Section 2 gives exact characterizations of the θ i 's for data models with/without regularization. (Earlier, Ilyas et al. (2022) noted this for a special case: p = 0.5, ℓ 2 regularization) 3. Using our framework, we give a new algorithm to estimate the degree to which a test function f is well-approximated by a linear datamodel, without having to train the datamodel per se. See Section 3.2, where our method needs only O(1/ϵ 3 ) samples instead of O(N/ϵ 2 ). 4. We study group influence, which quantifies the effect of adding or deleting a set I of datapoints to x. Ilyas et al. (2022) note that this can often be well-approximated by linearly adding the individual influences of points in I. Section 4 clarifies simple settings where linearity would fail, by a factor exponentially large in |I|, and also discusses potential reasons for the observed linearity. 1.2 OTHER RELATED WORK Narasimhan et al. (2015) investigate when influence is PAC learnable. Basu et al. (2020) use second order influence functions and find they make better predictions than first order influence functions. Other instance based interpretability techniques include Representer Point Selection (Yeh et al., 2018) , Grad-Cos (Charpiat et al., 2019) , Grad-dot (Hanawa et al., 2020) , MMD-Critic (Kim et al., 2016) , and unconditional counter-factual explanations (Wachter et al., 2017) . Variants on influence functions have also been proposed, including those using Fisher kernels (Khanna et al., 2019) , tricks for faster and more scalable inference (Guo et al., 2021; Schioppa et al., 2022) , and identifying relevant training samples with relative influence (Barshan et al., 2020) . Discrete influence played a prominent role in the surprising discovery of long tail phenomenon in Feldman (2020); Feldman & Zhang (2020) : the experimental finding that in large datasets like Im-ageNet, a significant fraction of training points are atypical, in the sense that the model does not easily learn to classify them correctly if the point is removed from the training set.

2. HARMONIC ANALYSIS, INFLUENCE FUNCTIONS AND DATAMODELS

In this section we introduce notations for the standard harmonic analysis for functions on the hypercube (O'Donnell, 2014) , and establish connections between the corresponding fourier coefficients, discrete influence of data points and linear datamodels from Ilyas et al. (2022) .

2.1. PRELIMINARIES: HARMONIC ANALYSIS

In the conceptual framework of Section 1.1, let [N ] := {1, 2, 3, ..., N }. Viewing f : {±1} N → R as a vector in R 2 N , for any distribution D on {±1} N , the set of all such functions can be treated as a vector space with inner product defined as ⟨f, g⟩ D = E x∼D [f (x)g(x)], leading to a norm defined as ∥f ∥ D = E x∼D [f (x) 2 ]. Harmonic analysis involves identifying special orthonormal bases for this vector space. We are interested in f 's values at or near p-biased points x ∈ {±1} N , where x is viewed as a random variable whose each coordinate is independently set to +1 with probability p. We denote this distribution as B p . Properties of f in this setting are best studied using the orthonormal basis functions {ϕ S : S ⊆ [N ]} defined as ϕ S (x) = i∈S xi-µ σ , where µ = 2p -1 and σ 2 = 4p(1p) are the mean and variance of each coordinate of x. Orthonormality implies that E x [ϕ S (x)] = 0 when S ̸ = ∅ and ⟨ϕ S , ϕ S ′ ⟩ Bp = 1 S=S ′ . Then every f : [N ] → R can be expressed as f = S⊆[N ] f S ϕ S . Our discussion will often refer to f S 's as "Fourier" coefficients of f , when the orthonormal basis is clear from context. This also implies Parseval's identity: S f 2 S = ∥f ∥ 2 Bp . For any vector z ∈ R d , we denote z i to denote its i-th coordinate (with 1-indexing). For a matrix A ∈ R d×r , A i,: ∈ R r and A :,j ∈ R d denote its i-th row and j-th column respectively. We use ∥ • ∥ to denote Euclidean norm when not specified.

2.2. INFLUENCE FUNCTIONS

We use the notion of influence of a single point from Feldman & Zhang (2020) ; Ilyas et al. (2022) . Influence of the i-th coordinate on f at x is defined as Inf i (f (x)) = f (x| i→1 ) -f (x| i→-1 ) , where x is sampled as a p-biased training set and x| i→1 is x with the i-th coordinate set to 1. Proposition 2.1 (Individual influence). The "leave-one-out" influences satisfy the following: Inf i (f (x)) = 2 σ S∋i f S ϕ S\{i} (x), Inf i (f ) := E x [Inf i (f (x))] = 2 σ f {i} . Thus the degree-1 Fourier coefficients are directly related to average influence of individual points. Similar results can be shown for other definition of single point influence: E x [f (x) -f (x| i→-1 )] and E x [f (x| i→1 ) -f (x)] are equal to p Inf i (f ) and (1 -p) Inf i (f ) respectively. The proof of this follows by observing that E x [f (x| i→1 ) -f (x| i→-1 )] = S∋i f S 1-µ σ --1-µ σ ϕ S\{i} (x). The only term that is not zero in expectation is the one for S = {i}, thus proving the result. Section 4 deals with the influence of add or deleting larger subsets of points. Continuous vs Discrete Influence Koh & Liang (2017) utilize a continuous notion of influence: train a model using dataset x ∈ {±1} N , and then treat the i-th coordinate of x as a continuous variable x i . Compute d dxi f at x i = 1 using gradients and Hessians of the loss at end of training. This is called the continuous influence of the i-th datapoint on f . As mentioned in Section 1 in several other contexts one uses the discrete influence 1, which has a better connection to harmonic analysis. While experiments in Koh & Liang (2017) suggest that continuous influence closely tracks the discrete influence in linear models, Bae et al. (2022) show that this breaks in deep learning settings. For the rest of the paper we will use discrete influences.

2.3. LINEAR DATAMODELS FROM A HARMONIC ANALYSIS LENS

Next we turn to the phenomenon Ilyas et al. (2022) that a function f (x) related to average test errorfoot_0 (where x is the training set) often turns out to be approximable by a linear function θ 0 + i θ i xi , where x ∈ {0, 1} N is the binary version of x ∈ {±1} N . It is important to note that this approximation (when it exists) holds only in a least squares sense, meaning that the following is small: E x [(f (x) -θ 0 -i θ i xi ) 2 ] where the expectation is over p-biased x. The authors suggest that θ i can be seen as an estimate of the average discrete influence of variable i. While this is intuitive, they do not give a general proof (their Lemma 2 proves it for p = 1/2 with ℓ 2 regularization). The following result exactly characterizes the solutions for arbitrary p and with both ℓ 1 and ℓ 2 regularization. Theorem 2.2 (Characterizing solutions to linear datamodels). Denote the quality of a linear datamodel θ ∈ R N +1 on the p-biased distribution over training sets B p by R(θ) := E x∼Bp   f (x) -θ 0 - N i=1 θ i xi 2   . (2) where x ∈ {0, 1} N is the binary version for any x ∈ {±1} N . Then for µ = 2p -1 and σ = 4p(1p), the following are true about the optimal datamodels with and without regularization: (a) The unregularized minimizer θ ⋆ = arg min R(θ) satisfies θ ⋆ i = 2 σ f {i} , θ ⋆ 0 = f ∅ - (µ + 1) 2 i θ ⋆ i . Furthermore the residual error is the sum of all Fourier coefficients of order 2 or higher. R(θ ⋆ ) = B ≥2 := S⊆[N ]:|S|≥2 f 2 S (4) (b) The minimizer with ℓ 2 regularization θ ⋆ (λ, ℓ 2 ) = arg min R(θ) + λ∥θ 1:N ∥ 2 2 satisfies θ ⋆ (λ, ℓ 2 ) i = 2 σ 1 + 4λ σ 2 -1 f {i} , (c) The minimizer with ℓ 1 regularization θ ⋆ (λ, ℓ 1 ) = arg min {R(θ) + λ∥θ 1:N ∥ 1 } satisfies θ ⋆ (λ, ℓ 1 ) i = 2 σ f {i} -λ /σ + --f {i} -λ /σ + = 2 σ sign f {i} f {i} -λ /σ + (6) where z + = z1 z>0 is the standard ReLU operation. This result shows that the optimal linear datamodel with various regularization schemes, for any p ∈ (0, 1) are directly related to the first order Fourier coefficients at p. Given that the average discrete influences, from Equation (1), are also the first order coefficients, this result directly establishes a connection between datamodels and influences. Result (c) suggests that ℓ 1 regularization has the effect of clipping the Fourier coefficients such that those with small magnitude are set to 0, thus encouraging sparse datamodels. Furthermore, Equation (4) also gives a simple expression for the residual of the best linear fit which we utilize for our efficient residual estimation procedure in Section 3.2. The proof of the full result is presented in Appendix B.1, however we present a proof sketch of the result (a) to highlight the role of the Fourier basis and coefficients. Proof sketch for Theorem 2.2(a). Since {ϕ S } S⊆[N ] is an orthonormal basis for the inner product space with ⟨f, g⟩ Bp = E x∼Bp [f (x)g(x)], we can rewrite R(θ) as follows R(θ) = E x∼Bp f (x) -θ 0 - i θ i xi 2 = E x∼Bp f (x) -θ0 - i θi ϕ {i} (x) 2 = f -θ0 - i θi ϕ {i} 2 Bp , where ∥ • ∥ Bp := ⟨•, •⟩ Bp where θi = σ /2 θ i and θ0 = θ 0 + (µ+1) /2 i θ i . Due to the orthonormality of {ϕ S } S⊆[N ] , the minimizer θ⋆ will lead to a projection of f onto the span of {ϕ S } |S|≤1 , which gives θi = f {i} and thus θ i = 2 /σ f {i} . Furthermore the residual is the norm of the projection of f onto the orthogonal subspace span {ϕ S } |S|≥2 , which is precisely |S|≥2 f 2 S . 3 NOISE STABILITY AND QUALITY OF LINEAR DATAMODELS Theorem 2.2 characterizes that best linear datamodel from Ilyas et al. (2022) for any the test function. The unanswered question is: Why does this turn out to be a good surrogate (i.e. with low residual error) for the actual function f ? A priori, one expects f (x), the result of deep learning on x, to be a complicated function. In this section we use the idea of noise stability to provide an intuitive explanation. Furthermore we show how noise stability can be leveraged for efficient estimation of the quality of fit of linear datamodels, without having to learn the datamodel (which, as noted in Ilyas et al. (2022) , requires training a large number of nets corresponding to random training sets x). Noise stability. Suppose x is p-biased as above. We define a ρ-correlated r.v. as follows: Definition 3.1 (ρ-correlated). For x ∈ {±1} N , we say a r.v. x ′ is ρ-correlated to x if it is sampled as follows: If x i = 1, then x ′ i = -x i w.p. (1 -ρ)(1 -p). If x i = -1, x ′ i = -x i w.p. (1 -ρ)p. Note that x ′ i = 1 w.p. p(1-(1-ρ)(1-p))+(1-p) (1-ρ)p = p, so both x and x ′ represent training datsets of expected size pN with expected intersection of (p+ρ(1-p))N . Define the noise-stability of f at noise rate ρ as h(ρ) = E x,x ′ [f (x)f (x ′ )] where x, x ′ are ρ-correlated. Noise-stability plays a role reminiscent of moment generating function in probability, since ortho-normality implies h(ρ) = E x,x ′ [f (x)f (x ′ )] = S f 2 S ρ |S| = d i=1 B i ρ i , where B i = S:|S|=i f 2 S . Thus h(ρ) is a polynomial in ρ where the coefficient of ρ i captures the ℓ 2 weight of coefficients corresponding to sets S of size i.

3.1. WHY SHOULD LINEAR DATAMODELS BE A GOOD APPROXIMATION

? Suppose f = S f S ϕ S is the function of interest. Let B i stand for S:|S|=i f 2 S . Define the normalized noise stability as h(ρ) = E x,x ′ [f (x)f (x ′ )] / E x [f (x) 2 ] for ρ ∈ [0, 1]. Intuitively, since f concerns test loss, one expects that as the number of training samples grows, test behavior of two correlated data sets x, x ′ would not be too different, since we can alternatively think of picking x, x ′ as first randomly picking their intersection and then augmenting this common core with two disjoint random datasets. Thus intuitively, normalized noise stability should be high, and perhaps close to its maximum value of 1. If it is indeed close to 1 then the next theorem (a modification of a related theorem for boolean-valued functions in O'Donnell ( 2014)) gives a first-cut bound on the quality of the best linear approximation in terms of the magnitude of the residual error. (Note that linear approximation includes the case that f is constant.) Theorem 3.1. The quality of the best linear approximation to f can be bounded in terms of the normalized noise stability h as The normalized noise stability from Theorem 3.1 is high for many points; ≈ 44% points are higher than 0.9. Also, noise stability is low for points that have margin close to 0, i.e. those that are close to the decision boundary. (b) Similarly, the estimated quality of linear fit (using Algorithm 1) is good when the model is confidently correct or wrong. (c) Noise stability for ρ = 0.05 correlates highly with the estimated quality of linear fit, providing credibility to Theorem 3.1. (Normalized residual) S:|S|≥2 f 2 S S f 2 S ≤ 1 -h(ρ) 1 -ρ 2 . ( Theorem 3.1 is well-known in Harmonic analysis and is in some sense the best possible estimate if all we have is the noise stability for a single (and small) value of ρ. In fact we find in Figure 1c that the noise stability estimate does correlate strongly with the estimated quality of linear fit (based on our procedure from Section 3.2). Figure 3 plots noise stability estimates for small values of ρ. In standard machine learning settings, it is not any harder to estimate h(ρ) for more than one ρ, and this is is used in next section in a method to better estimate of the quality of linear approximation.

3.2. BETTER ESTIMATE OF QUALITY OF LINEAR APPROXIMATION

A simple way to test the quality of the best linear fit, as in Ilyas et al. (2022) , is to learn a datamodel using samples and evaluate it on held out samples. However learning a linear datamodel requires solving a linear regression on N -dimensional inputs x, which could require O(N ) samples in general. A sample here corresponds to training a neural net on a p-biased dataset x, and training O(N ) such models can be expensive. (Ilyas et al. (2022) needed to train around a million models.) Instead, can we estimate the quality of the linear fit without having to learn the best linear datamodel? This question relates to the idea of property testing in boolean functions, and indeed our next result yields a better estimate by using noise stability at multiple points. The idea is to leverage Equation ( 7), where the Fourier coefficients of sets of various sizes show up in the noise stability function h(ρ) = N i=0 B i ρ i as the non-negative coefficients of the polynomial in ρ. Since the residual of the best linear datamodel, from Theorem 2.2, is the total mass of Fourier coefficients of sizes at least 2, i.e. B ≥2 = N i=2 B i , we can hope to learn this by fitting a polynomial on estimates of h(ρ) at multiple ρ's. Algorithm 1 precisely leverages by first estimating the degree 0 and 1 coefficients (B 0 and B 1 ) using noise stability estimates at a few ρ's, estimating B = N i=0 B i = h(1) using ρ = 1, and finally estimating the residual using the expression B ≥2 = B -B 0 -B 1 . The theorem below shows that this indeed leads to a good estimate of the residual with having to train way fewer (independent of N ) models. Theorem 3.2. Let B≥2 = RESIDUALESTIMATION(f, n, [0, ρ, 2ρ], 2) be the estimated residual (see Algorithm 1) after fitting a degree 2 polynomial to noise stability estimates at 0, ρ, 2ρ, using n calls to f . If n = O(1/ϵ 3 ) and ρ = √ ϵ, then with high probability we have that | B≥2 -B ≥2 | ≤ ϵ. The proof of this is presented in Appendix B.2. This improves upon prior results on residual estimation for linear thresholds from Matulef et al. (2010) that do a degree-1 approximation to h(ρ) with 1/ϵ 4 samples, more than 1/ϵ 3 samples needed with our degree-2 approximation instead. In fact, we hypothesize using a degree d > 2 likely improves upon the dependence on ϵ; we leave that for future work. This result provides us a way to estimate the quality of the best linear datamodel without d+1) 4: y ∈ R k , A ∈ R k×( for i ∈ [1, . . . , k] do 5:  y i ← NOISESTABILITY(f, n /(k+1), ρ i ) ▷ Estimate h(ρ i ) 6: A i,: ← [1, ρ i , . . . , h ρ ← 0 17: for j ∈ [1, . . . , n /2] do 18: x ∼ B p 19: x ′ ∼ RHOCORR(x, ρ) ▷ x ′ is ρ-correlated to x; see Definition 3.1 20: h ρ ← h ρ + 2 /nf (x)f (x ′ ) ▷ Two evaluations of f per x needed 21: end for

22:

return h ρ ▷ Returns an unbiased estimate for h(ρ); see Lemma B.1 23: end procedure having to use O(N/ϵfoot_1 ) samples (in the worse case) for linear regression 2 . The guarantee does not even depend on N , although it has a worse dependence of ϵ that can likely be improved upon. Experiments. We run our residual estimation algorithm for 1000 test examples (see Appendix C for details) and Figure 2 summarizes our findings. The histogram of the estimated normalized residuals (8) in Figure 2a indicates that a good linear fit exists for majority of the points, echoing the findings from Ilyas et al. (2022) . Figures 2b and 2c study the effects of choices like degree d and ρ. Furthermore we find in Figure 1b an interesting connection between the predicted quality of linear fit (1-normalized residual) and the average margin of the test point: linear fit is best when models trained on p-biased datasets are confidently right or wrong on average. The fit is much worse for examples that are closer to the decision boundary (smaller margin); exploration of this finding is an interesting future direction. Finally, Figures 3 and 4 provide visualizations of the learned polynomial fits as the degree d and list of ρ's are varied.

4. UNDERSTANDING GROUP INFLUENCE AND ABILITY FOR COUNTERFACTUAL REASONING

Both influence functions as well as linear datamodels display some ability to do counterfactual reasoning: to predict the effect of small changes to the training set on the model's behavior on test data. Specifically, they allow reasonable estimation of the difference between a model trained with x and one trained with x| I→-1 , the training set containing x after deleting the points in I. Averaged over x, this can be thought of as average group influence of deleting I. We study this effect through the lens of Fourier coefficients in the upcoming section.

4.1. EXPRESSION FOR GROUP INFLUENCE

Let I be a subset of variables and let x ∈ {-1, 1} N denote a random p-biased variable with distribution B p . Then the expectation of f (x)f (x| I→-1 ) can be thought of as the average (c) Estimates using [0, ρ, 2ρ, 1] in Algorithm 1, for ρ = 0.05 and ρ = 0.1; Spearman correlation is ≈ 0.8. The choice of ρ does have some effect, suggesting that there is still some noise in the estimate. Note that Theorem 3.2 requires the right scale of ρ for the estimate to be accuracy. influence of deleting I. An interesting empirical phenomenon in Koh et al. ( 2019) is that the group influence is often well-correlated with the sum of individual influences in I, i.e. i Inf i (f (x)). In fact this is what makes empirical study of group influence feasible in various settings, because influential subsets I can be found without having to spend time exponential in |I| to search among all subsets of this size. The claim in Ilyas et al. (2022) is that linear data models exhibit the same phenomenon: the sum of coefficients of coordinates in I approximates the effect of training on x| I→-1 . We give the mathematics of such "counterfactual reasoning" as well as its potential limits. Theorem 4.1. [Group influence] The following are true about group influence of deletion. Inf I (f ) := E x∼Bp [f (x) -f (x| I→-1 )] = p i∈I θ ⋆ - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ = p i∈I Inf i (f ) - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ where θ ⋆ is the optimal datamodel (Equation (3)) and Inf i is an individual influence (Equation (1)). Thus we see the that average group influence of deletionfoot_2 is equal to the sum of individual influence, plus a residual term. Proof for the above theorem for this is presented in Appendix B.3. This residual term can in turn be upper bounded by (1p) -|I|/2 B ≥2 (see Lemma B.2), which blows up exponentially in |I|. However the findings in Figure F.1 from Ilyas et al. (2022) suggest that the effect of number of points deleted is linear (or sub-linear), and far from an exponential growth. In the next section we provide a simple example where the exponential blow-up is unavoidable, and also provide some hypothesis and hints as to why this exponential blow is not observed in practice.

4.2. THRESHOLD FUNCTION AND EXPONENTIAL GROUP INFLUENCE

We exhibit the above exponential dependence on set size using a natural function inspired by empirics around the long tail phenomenon (Feldman, 2020; Feldman & Zhang, 2020) , which suggests that successful classification on a specific test point depends on having seeing a certain number of "closely related" points during training. We model this as a sigmoidal function that depends on a special set A of coordinates (i.e., training points) and assigns probability close to 1 when many more than β fraction of the coordinates in A are +1.

For any vector

z ∈ {±1} d , let avg(z) = 1 /d d i=1 (1+zi) 2 denote the fraction of 1's in z and z A = (z i ) i∈A denotes the subset of coordinates indexed by A ⊆ [d]. Example 1 (Sigmoid). Consider a function f (x) = S (α(avg(x A ) -β)) for a subset A ⊆ [N ] of size M , where S(u) = (1 + e -u ) -1 is the sigmoid function. The function f could represent the probability of the correct label for a test point when trained on the training set x ∈ {±1} N . For this function, we show below that for a large enough p, the group influence of a γ-fraction subset of A is exponential in the size of the set. Lemma 4.2. Consider the functionfoot_3 f from Example 1 with β = 0.5 and α → ∞; so f (x) = 1 {avg(x A ) > 0.5}. If x is p-biased with p > 0.5, then for any constant fraction subset A ′ ⊆ A, its group influence is exponentially larger than sum of individual influences. 2022) however do not fit a datamodel to the probability and instead fit it to the "margin", which is the log of the ratio of probabilities of the correct label to the highest probability assigned to another label. Presumably this choice was dictated by better empirical results, and now we see that it may play an important role in the observed linearity of group influence. Specifically, their margin does not saturate and can get arbitrarily large as the probability of the correct label approaches 1. This can be seen by considering a slightly different (but related) notion of margin, defined as Inf A ′ (f ) ≥ (2(1 -p)) -|A ′ |+1 |A ′ | i∈A ′ Inf i (f ) . f = log (f /(1 -f )) , where the denominator is the total probability assigned to other labels instead of the max among other labels. For this case, the following result shows that group influence is now adequately represented by the individual influences. Corollary 4.3. For a function f from Example 1, the group influence of any set A ′ ⊆ [N ] on the margin function f satisfies the following: Inf A ′ ( f ) = p i∈A ′ Inf i ( f ) This result follows directly by observing that the margin function is simply the inverse of the sigmoid function, and so its expression is f (x) = α(avg(x A )β) which is just a linear function. Since all the Fourier coefficients of sets of size 2 or larger will be zero for a linear function, the result follows from Theorem 4.1 where the residual term becomes 0.

5. CONCLUSION

This paper has shown how harmonic analysis can shed new light on interesting phenomena around influence functions. Our ideas use a fairly black box view of deep nets, which helps bypass the current incomplete mathematical understanding of deep learning. Our new algorithm of Section 3.2 has the seemingly paradoxical property of being able to estimate exactly the quality of fit of datamodels without actually having to train (at great expense) the datamodels. We hope this will motivate other algorithms in this space. One limitation of harmonic analysis is that an arbitrary f (x) could in effect behave very differently for on p-biased distributions at p = 0.5 versus p = 0.6. But in deep learning, the trained net probably does not change much upon increasing the training set by 20%. Mathematically capturing this stability over p (not to be confused with noise stability of Equation ( 7) via a mix of theory and experiments promises to be very fruitful. It may also lead to a new and more fine-grained generalization theory that accounts for the empirically observed long-tail phenomenon. Finally, harmonic analysis is often the tool of choice for studying phase transition phenomena in random systems, and perhaps could prove useful for studying emergent phenomena in deep learning with increasing model sizes. Rui Zhang and Shihua Zhang. Rethinking influence functions of neural networks in the overparameterized regime. AAAI, 2022. Jianlong Zhou, Zhidong Li, Huaiwen Hu, Kun Yu, Fang Chen, Zelin Li, and Yang Wang. Effects of influence on user trust in predictive decision making. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-6, 2019. A ADDITIONAL BACKGROUND

A.1 HARMONIC ANALYSIS AND BOOLEAN FUNCTIONS

We provide some more details for the classical Harmonic analysis on the boolean hypercube. Please refer to O'Donnell ( 2014) for an extensive discourse on this topic. The set of all functions f : {±1} N → R can be trivially viewed as a vector space, since is it closed under linear combinations. Further for any distribution D over the hypercube {±1} N , the set of real valued functions can be viewed as an inner product space, where the inner product is defined as ⟨f, g⟩ D = E x∼D [f (x)g(x) ]. An alternative way to view this inner product is to define a 2 N dimensional vector f ∈ R 2 N for every function f , where f = D(x)f (x) x∈{±1} N . Under this parameterization, the aforementioned inner product between f and g can be viewed as the standard dot product between f and g, i.e. f ⊤ g = x∈{±1} N D(x)f (x) D(x)g(x) = x∈{±1} N D(x)f (x)g(x) = E x∼D [f (x)g(x)] = ⟨f, g⟩ D . This inner product leads to the norm defined as ∥f ∥ D = E x∼D [f (x) 2 ]. Harmonic analysis involves identifying special orthonormal bases for this vector space. Studying the behavior of function f for points that have roughly p-fraction of the coordinates 1 is possible by considering the distribution of p-biased points x ∈ {±1} N , where x is viewed as a random variable whose each coordinate is independently set to +1 with probability p. We denote this distribution as B p . Properties of functions f on the hypercube in this setting are best studied by decomposing it using an orthonormal basis functions for the inner product space defined by ⟨•, •⟩ D . It turns out that the functions {ϕ S : S ⊆ [N ]} defined as ϕ S (x) = i∈S xi-µ σ form the necessary orthonormal basis, where µ = 2p -1 and σ 2 = 4p(1p) are the mean and variance of each coordinate of x. For orthonormality, we require that ⟨ϕ S , ϕ S ′ ⟩ Bp = 1 S=S ′ . To see why this is true, we note that ⟨ϕ S , ϕ S ′ ⟩ Bp = E x i∈S x i -µ σ i∈S ′ x i -µ σ (11) = E x i∈S∩S ′ x i -µ σ 2 i∈S∆S ′ x i -µ σ (12) = i∈S∩S ′ E xi x i -µ σ 2 i∈S∆S ′ E xi x i -µ σ (13) = i∈S∆S ′ E xi x i -µ σ where the S∆S ′ denotes the symmetric difference between the sets. The second to last equality follows from the independece of x i 's in B p , while the last equality follows from the definition of mean µ and variance σ 2 . Thus if S = S ′ , S∆S ′ = ∅ and this inner product is 1, while if S ̸ = S ′ , |S∆S ′ | > 0 and this product is 0 by definition of µ. Hence ϕ S 's indeed form an orthonormal basis. Given an orthonormal basis, every function f : [N ] → R can be expressed as a linear combination of this basis, i.e. f = S⊆[N ] f S ϕ S . The coefficients are often refered to f S 's as "Fourier" coefficients of f , when the orthonormal basis is clear from context. The orthonormality of ϕ S 's also implies Parseval's identity: S f 2 S = ∥f ∥ 2 Bp , since the ℓ 2 norm measured using any orthonormal basis should be the same.

A.2 LINEAR DATAMODELS

We provide some more background on linear datamodels from Ilyas et al. (2022) in the context of this paper; please refer to the original paper for more details. Consider a training set that is a subset of [N ], denoted by x ∈ {±1} N . The goal of datamodels is to be able to predict the outcome of training a neural network, using a fixed training algorithm, on a particular training set. Consider a function f (x) that denotes that results of training on x, where f can denote any function of the resultant network, e.g. test loss, loss on a particular test example, margin for a test example, Published as a conference paper at ICLR 2023 norm of parameters etc. Since the training algorithm involves randomness from various sources (initialization, SGD), f can be viewed as an average over this randomness for a given training set x. In general, f could be a very complicated function of the training set x; for instance it could strongly depend on the presence or absence of particular subsets of datapoints. Furthermore due to the blackbox nature of trained networks, one would not expect there to be a simple description of the function f . Rather surprisingly, Ilyas et al. (2022) found that for the function f that denotes the margin for a single test example, f can on average be approximated by a linear function of the indicator of the training set x, i.e. f (x) ≈ θ ⊤ x for x ∼ B p . One implication of this result is that the effect of adding or deleting a single point i ∈ [N ] is independent of the rest of the training set and only depends on the size of the training set. A more important consequence is that θ can be estimated by training evaluating f on many x's -which would require training a net for these x's -and then fitting a linear function θ on those samples. This can be used to not only estimate the individual influence θi of datapoints, but also gives a way to predict f on a new x by using the estimate θ⊤ x rather than actually training a net on x. Details on how θ is learned using Lasso regression (to encourage sparsity) can be found in Ilyas et al. (2022) . Using Harmonic analysis, we show in Theorem 2.2 that the outcome of linear/Ridge/Lasso regression will provably be equal to the individual discrete influence of each point i.

B OMITTED PROOFS B.1 PROOFS FOR SECTION 2

We recall Theorem 2.2: Theorem 2.2 (Characterizing solutions to linear datamodels). Denote the quality of a linear datamodel θ ∈ R N +1 on the p-biased distribution over training sets B p by R(θ) := E x∼Bp   f (x) -θ 0 - N i=1 θ i xi 2   . (2) where x ∈ {0, 1} N is the binary version for any x ∈ {±1} N . Then for µ = 2p -1 and σ = 4p(1p), the following are true about the optimal datamodels with and without regularization: (a) The unregularized minimizer θ ⋆ = arg min R(θ) satisfies θ ⋆ i = 2 σ f {i} , θ ⋆ 0 = f ∅ - (µ + 1) 2 i θ ⋆ i . Furthermore the residual error is the sum of all Fourier coefficients of order 2 or higher. R(θ ⋆ ) = B ≥2 := S⊆[N ]:|S|≥2 f 2 S (4) (b) The minimizer with ℓ 2 regularization θ ⋆ (λ, ℓ 2 ) = arg min R(θ) + λ∥θ 1:N ∥ 2 2 satisfies θ ⋆ (λ, ℓ 2 ) i = 2 σ 1 + 4λ σ 2 -1 f {i} , (c) The minimizer with ℓ 1 regularization θ ⋆ (λ, ℓ 1 ) = arg min {R(θ) + λ∥θ 1:N ∥ 1 } satisfies θ ⋆ (λ, ℓ 1 ) i = 2 σ f {i} -λ /σ + --f {i} -λ /σ + = 2 σ sign f {i} f {i} -λ /σ + ( ) where z + = z1 z>0 is the standard ReLU operation. Proof. For all three parts, we will study the problem in the Fourier basis corresponding to the distribution B p and the proof works for every p ∈ (0, 1). Since {ϕ S } S⊆[N ] is an orthonormal basis for the inner product space with ⟨f, g⟩ Bp = E x∼Bp [f (x)g(x)], we can rewrite R(θ) as follows R(θ) := E x∼Bp f (x) -θ 0 - i θ i xi 2 = (i) E x∼Bp f (x) -θ0 - i θi ϕ {i} (x) 2 (15) = (ii) f -θ0 - i θi ϕ {i} 2 Bp , where ∥g∥ Bp := ⟨g, g⟩ Bp = E x∼Bp [g(x) 2 ] (16) and θi = σ 2 θ i , θ0 = θ 0 + (µ + 1) 2 i θ i The first equality (i) follows by observing that x i = 2x i -1 and that ϕ {i} (x) = (xi-µ) /σ. Step (ii) follows by applying the definition of ∥g∥ Bp to the function g = f -θ0i θi ϕ {i} . We can further simplify Equation ( 16) as follows: R(θ) = f -θ0 - i θi ϕ {i} 2 Bp = f ∅ -θ0 + S:|S|=1 f S -θi ϕ S + S:|S|≥2 f S ϕ S 2 Bp (18) = (i) θ0 -f ∅ 2 + N i=1 θi -f {i} 2 + S:|S|≥2 f 2 S ( ) where (i) uses orthonormality of {ϕ S } S⊆[N ] and Parseval's theorem (O'Donnell, 2014) . With this expression for R(θ), we are now ready to prove the main results. Proof for (a): From Equation ( 19), it is evident that the minimizer for the unregularized objective will satisfy θ⋆ 0 = f ∅ and θ⋆ i = f {i} for i ∈ [N ]. Plugging this into Equation (17) yields the desired expressions for θ ⋆ . Furthermore, the residual for this optimal θ ⋆ is S:|S|≥2 f 2 S . Proof for (b): The ℓ 2 regularized objective can be written as follows: R(θ) + λ∥θ 1:N ∥ 2 2 = R(θ) + λ N i=1 θ 2 i = (i) R(θ) + 4λ σ 2 N i=1 θ2 i where (i) follows from Equation ( 17). Combining this with Equation ( 19), we observe that minimization w.r.t. θi can be done independently of each other. Thus θ⋆ i = arg minθ i θi -f {i} 2 + 4λ σ 2 θ2 i = 1 + 4λ σ 2 -1 f {i} . Plugging this into Equation ( 17) gives the desired expression for θ ⋆ .

Proof for (c):

The ℓ 1 regularized objective can be written as follows: R(θ) + λ∥θ 1:N ∥ 1 = R(θ) + λ N i=1 |θ i | = (i) R(θ) + 2λ σ N i=1 | θi | Again the minimization w.r.t. θi can be done independently:  θ⋆ i = arg minθ i θi -f {i} 2 + 2λ σ | θi | = f {i} -λ /σ + --f {i} -λ /σ + . S:|S|≥2 f 2 S S f 2 S ≤ 1 -h(ρ) 1 -ρ 2 . ( ) Proof. We first note that h(ρ) = h(ρ)/h(1) since E[f (x) 2 ] is the noise stability when x ′ = x which happens at ρ = 1. Let B k = S:|S|=k f 2 S . From Equation (7) we get h(ρ) = S ρ |S| f 2 S S f 2 S = k ρ k B k k B k ≤ B 0 + B 1 + ρ 2 k≥2 B k k B k Let B = S f 2 S = k f 2 k and B ≥2 = S:|S|≥2 f 2 S = k≥2 f 2 k . Then we get, h(ρ) ≤ (B -B ≥2 ) + ρ 2 B ≥2 B = 1 -(1 -ρ 2 ) B ≥2 B =⇒ B ≥2 B ≤ 1 -h(ρ) 1 -ρ 2 This completes the proof. Lemma B.1. For a function f and ρ ∈ [0, 1] and evaluation budget n, let ĥ(ρ) = NOISESTABILITY(f, n, ρ) be the noise stability estimate from Algorithm 1. If |f (x)| ≤ C for every x, then with probability at least 1η, the error in the estimate can be upper bounded by ĥ(ρ) -h(ρ) ≤ δ(n) = O C 2 log(1/η)/n Proof. The proof follows from a straightforward application of Hoeffding's inequality. Note that ĥ (ρ) = 2 n n/2 i=1 [f (x (i) )f (x ′(i) )] is a sum of n /2 i.i.d. variables that are all in the range [-C 2 , C 2 ]. Additionally since E[ ĥ(ρ)] = h(ρ), Hoeffding's inequality gives us P ĥ(ρ) -h(ρ) ≥ δ ≤ 2e -δ 2 n C 2 . Setting δ = C 2 log(2/η)/n makes this probability at most η, thus completing the proof. We now prove Theorem 3.2. Recall the statement: Theorem 3.2. Let B≥2 = RESIDUALESTIMATION(f, n, [0, ρ, 2ρ], 2) be the estimated residual (see Algorithm 1) after fitting a degree 2 polynomial to noise stability estimates at 0, ρ, 2ρ, using n calls to f . If n = O(1/ϵ 3 ) and ρ = √ ϵ, then with high probability we have that | B≥2 -B ≥2 | ≤ ϵ. Proof. Using standard concentration inequalities, Lemma B.1 shows that the estimations are close enough to the true values with high probability, i.e. y i = ĥ(ρ i ) = h(ρ i ) + δ i where |δ i | ≤ δ where δ := δ(n/k) = O k/n from Lemma B.1. Let ẑ be the solution to Algorithm 1 and let z ⋆ = [B j ] d j=0 be the "true" coefficients up to degree d. Since ẑ minimizes ∥Az -y∥ 2 and z ⋆ is also a valid solution, we have ∥A ẑ -y∥ 2 ≤ ∥Az ⋆ -y∥ 2 Furthermore, we note the following about Az ⋆ : (Az ⋆ ) i = d j=0 B j ρ j i = h(ρ i ) - j>d B j ρ i ∈ h(ρ i ) -B >d ρ d+1 i , h(ρ i ) We can upper bound ∥Az ⋆ -y∥ 2 by observing that |(Az ⋆ ) i -y i | ≤ |δ i | + B >d ρ d+1 i ≤ δ + B >d ρ d+1 i (26) ∥Az ⋆ -y∥ 2 ≤ k δ + B >d ρ d+1 i 2 Using these, we measure the closeness of ẑ to z ⋆ as follows: ∥Az ⋆ -A ẑ∥ 2 ≤ 2 ∥Az ⋆ -y∥ 2 + ∥A ẑ -y∥ 2 ≤ 4∥Az ⋆ -y∥ 2 where the first inequality follows from triangle inequality and Cauchy-Schwarz inequality, and the last inequality follows from Equation (24). A naive upper bound for ∥z ⋆ -ẑ∥ is λ min (A) -1 ∥Az ⋆ -A ẑ∥. However this turns out to be quite a large and suboptimal upper bound. Since Algorithm 1 only returns ẑ1 and ẑ2 , corresponding to estimates of B Combining Equations ( 27), ( 28) and ( 30), we get that | ẑ1 -B 0 | and | ẑ2 -B 1 | ≤ O δ + B ≥3 ρ 3 ρ = O δ ρ + B ≥3 ρ 2 (31) Picking the optimal value of ρ = Θ (δ/B ≥3 ) 1/3 = O n -1/6 B -1/3 ≥3 , and using the fact that B≥2 = ĥ(1) -ẑ1 -ẑ2 we get B ≥2 = h(1) -B 0 -B 1 = B≥2 + O δ 2/3 B 1/3 ≥3 = Õ B ≥3 n 1/3 Thus to achieve an ϵ approximation, we need ρ = Θ ( √ ϵ) and n = Ω B ≥3 /ϵ 3 , which completes the proof.

B.3 PROOFS FOR SECTION 4

We recall Theorem 4.1 Theorem 4.1. [Group influence] The following are true about group influence of deletion. Inf I (f ) := E x∼Bp [f (x) -f (x| I→-1 )] = p i∈I θ ⋆ - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ = p i∈I Inf i (f ) - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ where θ ⋆ is the optimal datamodel (Equation (3)) and Inf i is an individual influence (Equation (1)). , where µ = 2p -1 and σ = 4p(1p). Thus setting x I to -1 will have the following effect:  f (x| I→-1 ) = S⊆[N ] I ′ ⊆I (-1) |I ′ | p 1 -p |I ′ |/2 f S∪I ′ (35) where in step (i) we collect all terms that share the same ϕ S\I by relabeling S ← S\I and I ′ ← S ∩ I. This proves the first part of the result. For the second part we first note that E x [f (x)] = f ∅ . Secondly, the only term that remains in E x [f (x| I→-1 )] is the constant term (i.e. coefficient of the basis function ϕ ∅ (x). From Equation ( 35), only the terms with S = ∅ remain. So, E x [f (x) -f (x| I→-1 )] = f ∅ - I ′ ⊆I (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ (36) = (i) i∈I p 1 -p f {i} - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ (37) = p i∈I 2 σ f {i} - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ (38) = (ii) p i∈I θ ⋆ i - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ (39) = (iii) p i∈I Inf i (f ) - I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ where (i) follows by separating out the size 0 and size 1 I ′ 's, (ii) follows from Theorem 2.2 and (iii) follows from Proposition 2.1. We now show an upper bound on the residual term from the previous result in terms of the residual of a linear datamodels. Lemma B.2. The residual term of group influence minus sum of individual influences from Theorem 4.1 can be upper bounded as follows: I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′ < (1 -p) -|I|/2 B ≥2 where B ≥2 is the residual of the best linear datamodel as defined in Equation (4). Proof. We use Cauchy-Schwarz inequality to upper bound this residual.   I ′ ⊆I,|I ′ |≥2 (-1) |I ′ | p 1 -p |I ′ |/2 f I ′   2 ≤   I ′ ⊆I,|I ′ |≥2 p 1 -p |I ′ |     I ′ ⊆I,|I ′ |≥2 f 2 I ′   (42) <   I ′ ⊆I p 1 -p |I ′ |     I ′ ⊆[N ],|I ′ |≥2 f 2 I ′   (43) = 1 + p 1 -p |I| B ≥2 = (1 -p) -|I| B ≥2 This completes the proof.



The average is not over test error but over the difference between the correct label logits and the top logit among the rest, at the output layer of the deep net. If the datamodel is sparse, lasso can learn it with O(S log N/ϵ 2 ) samples, where S is the sparsity. Similar results can be shown for the average group influence of adding a set of points I. The result can potentially be shown for more general α, β, p The result can potentially be shown for more general α, β, p



Figure1: Scatter plots for 1000 test examples and their margin functions f (log of ratio of correct label probability and the best probability for another label), for networks trained on p-biased sets x. (a) The normalized noise stability from Theorem 3.1 is high for many points; ≈ 44% points are higher than 0.9. Also, noise stability is low for points that have margin close to 0, i.e. those that are close to the decision boundary. (b) Similarly, the estimated quality of linear fit (using Algorithm 1) is good when the model is confidently correct or wrong. (c) Noise stability for ρ = 0.05 correlates highly with the estimated quality of linear fit, providing credibility to Theorem 3.1.

Figure 2: (a) Histogram of estimated normlized residual ( B≥2 / B), for 1000 test examples, using Algorithm 1 with degree d = 2 and list [0, 0.1, 0.2, 1] for ρ's. Almost half of the test examples have normalized residuals below 0.1. (b) Estimations using d = 2 and d = 3, for various test points, are highly correlated to each other, suggesting a small effect of degree of approximation in this case.(c) Estimates using [0, ρ, 2ρ, 1] in Algorithm 1, for ρ = 0.05 and ρ = 0.1; Spearman correlation is ≈ 0.8. The choice of ρ does have some effect, suggesting that there is still some noise in the estimate. Note that Theorem 3.2 requires the right scale of ρ for the estimate to be accuracy.

0 and B 1 , we only need to upper bound | ẑiz ⋆ i | for i ∈ [2].We now analyze the special case of k = 3, d = 2, ρ 1 = 0, ρ 2 = ρ, ρ 3 = 2ρ. Here we have ∆ = z ⋆ẑ. Firstly note that (A∆) 1 = ∆ 1 and so |∆ 1 | ≤ ∥A∆∥. To upper bound ∆ 2 , we note thatA∆ = ∥A∆∥ = ∥ρ∆ 2 v + Bu∥ ≥ min w ∥ρ∆ 2 v + Bw∥ = ρ|∆ 2 | I -B † B v ≥ 0.39ρ|∆ 2 | (30)where the last inequality follows from numeric calculation of the norm, and the last inequality follows from the fact that the minimum ℓ 2 can be obtained by finding the residual after projecting onto B.Thus | ẑ1 -B 0 | = |∆ 1 | ≤ ∥A∆∥ and | ẑ2 -B 1 | = |∆ 1 | ≤ ∥A∆∥/(0.39ρ).

Recall that the function f can be decomposed into the Fourier basis as f(x) = S⊆[N ] f S ϕ S (x) = S⊆[N ] f S i∈S (xi-µ) σ

Cohen et al. (2020) use influence functions to detect adversarial examples.Kong et al. (2021) propose an influence based re-labeling function that can relabel harmful examples to improve generalization instead of just discarding them. Zhang & Zhang (2022) use Neural Tangent Kernels to understand influence functions rigorously for highly overparametrized nets.Pruthi et al. (2020) give another notion of influence by tracing the effect of data points on the loss throughout gradient descent.Chen et al. (2020) define multi-stage influence functions to trace influence all the way back to pre-training to find which samples were most helpful during pre-training.

Algorithm 1 Efficient algorithm for residual estimation 1: procedure RESIDUALESTIMATION(f , n, [ρ 1 , . . . , ρ k ], d)

9) Thus in the above example when p > 0.5, the group influence can be exponentially larger than what is captured by individual influences. Proof of this lemma is presented in Appendix B.3 Margin v/s probability. The function f from Lemma 4.2 has a property that it saturates to 1 quickly, so once p is large enough, the individual influence of deleting 1 point can be extremely small, but the group influence of deleting sufficiently many of the influential points can switch the probability to close to 0. Ilyas et al. (

Published as a conference paper at ICLR 2023

We now prove the exponential blow up in group influence from Lemma 4.2 (restated below). Lemma 4.2. Consider the function 5 f from Example 1 with β = 0.5 and α → ∞; so f (x) = 1 {avg(x A ) > 0.5}. If x is p-biased with p > 0.5, then for any constant fraction subset A ′ ⊆ A, its group influence is exponentially larger than sum of individual influences.Proof. Let M = |A| and suppose |A ′ | = γ|A| for a small constant γ. We study Inf A ′ (f ) using its definition and for a general β ∈ (0, 1). The function of interest is f (x) = 1{avg(x A ) > β. Let B(n, q) denote the binomial r.v. that is a of n independent Bernoulli's with parameter q. We first note that we can rewrite the expected value of f as follows:This is because the only times f (x) is 1 is when at least β fraction of indices in A are 1, which is precisely capture by the Binomial distribution tail probability. Similarly, we can argue thatThe group influence can be calculated as follows:- -γ-β) . The the above expression reduces toWe study the quantity c β (γ) in more detail. In particular, we use Stirling's approximation for binomial coefficients log 2 n qn ≈ nH(q), where H(q) = -q log 2 q -(1q) log 2 (1q) is the entropy function. This gives a cleaner expression for log 2 (c β (γ)).For a small γ, we use a linear approximation for log 2 (c β (γ)) at γ = 0. The derivate iswhere g(β) is a decreasing function of β. This gives us the approximation for c β (γ)We note that for β ′ ≤ 0.5 and p > 1/2, g(β ′ ) > 0 and so c β ′ (γ) is an increasing function in γ.For ratio of group to individual influence from Equations ( 52) and ( 53), we consider the following expression for β ′ ≤ β = 0.5:Plugging this back into Equations ( 52) and ( 53), we see that the terms for each β ′ in the summation are upper bounded by Equation ( 61). This givesConsidering the total individual influences instead of just influence for just 1 point completes the proof.

C EXPERIMENTAL SETUP

Experiments are conducted on the CIFAR-10 data to test the estimation procedure and the quality of the linear fit in Figures 1 and 2 . We use the FFCV library Leclerc et al. (2022) to train models on CIFAR-10; each model takes ∼ 30s to train on our GPUs. We first pick subset of 10k images from the CIFAR-10 training dataset. Our models trained are then trained on sets of size 5000 (corresponding to p = 0.5), which achieve an average of ∼ 71% accuracy on the CIFAR-10 test set.For the noise stability estimates from Algorithm 1, we sample 600 pairs of ρ-correlated datasets (x, x ′ ) each for ρ = 0.05, 0.1, and 0.2. We train 12,000 models for each setting of ρ, where there are 600 distinct sets x + 600 distinct ρ-correlated sets x ′ chosen and the remaining 10× runs are due to running 10 random seeds per training set. We use the default ResNet based architecture in FFCV with a batch size of 512, an initial learning rate of 0.5, 24 epochs, weight decay of 5e-4 and SGD with momentum as the optimizer.For the experiment with residual estimation, we use the set of ρ's to be [0, 0.1, 0. Although the theoretical analysis in Theorem 3.2 does not use ρ = 1 for the polynomial fitting, we find experimentally that adding ρ = 1 gives more robust estimations. This is evident from the polynomial fits obtained for 20 randomly selected test examples in Figures 3 and 4 , where the fits from [0, 0.1, 0.2, 1] are clearly better than those from [0, 0.1, 0.2]. The theory can also be extended for this case, with a slightly modified analysis. For each example (corresponding to a plot), we plot the estimated ĥ(ρ) in blue dots and the best degree d ∈ {1, 2, 3} (black, red and blue curves respectively) polynomial fits obtained from Algorithm 1 for the list of ρ's being [0, 0.1, 0.2, 1]. The test examples are rearranged such that the first 4 rows contain points with min ρ ĥ(ρ) > 0.7 and the y-axis range is set to [0.6, 1] for clarity of presentation. The y-axis range for the last row is set to [0, 1]. The estimated residual from Algorithm 1 is precisely equal to one minus the sum of the constant and degree-1 coefficients. For each example (corresponding to a plot), we plot the estimated ĥ(ρ) in blue dots and the best degree d ∈ {1, 2, 3} (black, red and blue curves respectively) polynomial fits obtained from Algorithm 1 for the list of ρ's being [0, 0.1, 0.2]. The test examples are rearranged such that the first 4 rows contain points with min ρ ĥ(ρ) > 0.7 and the y-axis range is set to [0.6, 1] for clarity of presentation. The y-axis range for the last row is set to [0, 1]. The estimated residual from Algorithm 1 is precisely equal to one minus the sum of the constant and degree-1 coefficients.

