PROVABLY AUDITING ORDINARY LEAST SQUARES IN LOW DIMENSIONS

Abstract

Auditing the stability of a machine learning model to small changes in the training procedure is critical for engendering trust in practical applications. For example, a model should not be overly sensitive to removing a small fraction of its training data. However, algorithmically validating this property seems computationally challenging, even for the simplest of models: Ordinary Least Squares (OLS) linear regression. Concretely, recent work defines the stability of a regression as the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion (Broderick et al., 2020) , specifically meaning that the sign of a particular coefficient of the OLS regressor changes. But the only known approach for estimating this metric, besides the obvious exponentialtime algorithm, is a greedy heuristic that may produce severe overestimates and therefore cannot certify stability. We show that stability can be efficiently certified in the low-dimensional regime: when the number of covariates is a constant but the number of samples is large, there are polynomial-time algorithms for estimating (a fractional version of) stability, with provable approximation guarantees. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where our estimator outperforms the greedy heuristic, and can successfully certify stability even in the regime where a constant fraction of the samples are dropped.

1. INTRODUCTION

A key facet of interpretability of machine learning models is understanding how different subsets of the training data influence the learned model and its predictions. Computing the influences of individual training points has been shown to be a useful tool for enhancing trust in the model (Zhou et al., 2019) , for tracing the origins of model bias (Brunet et al., 2019) , and for identifying mislabelled training data and other model debugging (Koh & Liang, 2017) . Modelling the influence of groups of training points has applications to measuring fairness (Chen et al., 2018) , vulnerability to contamination of multi-source training data (Hayes & Ohrimenko, 2018) , and (most relevant to this paper) identification of unstable predictions (Ilyas et al., 2022) and models (Broderick et al., 2020) . In a high-stakes machine learning application, it would likely be alarming if some data points were so influential that the removal of, say, 1% of the training data dramatically changed the model. An ideal, trustworthy machine learning pipeline therefore should include validation that this does not happen. But the obvious algorithm for checking if a model trained on n data points exhibits this instability would require computing the group influences of n n/100 different subsets of the data, which is computationally infeasible even for fairly small n. Instead, current methods for estimating the stability of a model simply use the first-order approximation of group influence: namely, the sum of individual influences of data points in the group. With this approximation, vulnerability of a model to dropping αn data points is heuristically estimated by dropping the αn most individually influential data points (Broderick et al., 2020; Ilyas et al., 2022) . This heuristic can be thought of as using "local" stability as a proxy for "global" stability, and it has found substantial anecdotal success in diagnosing unstable models. Unfortunately, for correlated groups of data points, the first-order approximation of the group influence is often an underestimate (Koh et al., 2019) , so large local stability does not actually certify that a model is provably stable to removing small subsets of data. In fact, stability certification is a challenging and open problem even in the simplest of models: linear regression via Ordinary Least Squares (OLS). Concretely, given a regression dataset, a natural metric for the stability of the OLS regressor is the minimum number of data points that need to be removed from the dataset to flip the sign of a particular coefficient of the regressor (e.g., in causal inference, the coefficient measuring the treatment effect). Recent work has used the local stability heuristic to diagnose unstable OLS regressions in several prominent economics studies (Broderick et al., 2020) , identifying examples where even a statistically significant conclusion can be overturned by removing less than 1% of the data points. However, the converse question of validating stable conclusions remains unaddressed: Given a regression dataset, can we efficiently certify non-trivial lower bounds on the stability of the OLS regressor? Our work takes steps towards addressing this question, via the following contributions: • We introduce a natural fractional relaxation of the above notion of OLS stability, where we allow removing fractions of data points, and seek to minimize the total removed weight. We call this finite-sample stability, and henceforth refer to the prior notion as "integral" stability. • We develop approximation algorithms for the finite-sample stability, with (a) provable guarantees under reasonable anti-concentration assumptions on the dataset, and (b) running time polynomial in the size of the dataset, so long as the dimension of the data is a constant (in contrast, the naive algorithm is exponential in the size of the dataset). Moreover, we prove that (at least for exact algorithms) exponential dependence of the running time on the dimension is unavoidable under standard complexity assumptions. • We use modifications of our algorithms to compute assumption-free upper and lower bounds on the finite-sample stability of several simple synthetic and real datasets, achieving tighter upper bounds than prior work and the first non-trivial lower bounds, i.e. certifications that the OLS regressor is stable. Why define stability this way? The definition of integral stability was introduced in (Broderick et al., 2020) , along with several variants (e.g. smallest perturbation which causes the first coordinate to lose significance). We choose the definition based on the sign of the first coordinate, because it has clear practical interpretation-does the first covariate positively or negatively affect the response?which does not depend on choice of additional parameters such as significance level. We study the fractional relaxation so that the stability is defined by a continuous optimization problem. Note that certifying a lower bound on fractional stability immediately certifies a lower bound on the integral stability; we will see later (Remark 3.1) that a near-converse holds in low dimensions. Why is low-dimensional regression important? Given that much of machine learning happens in high-dimensional settings, where the number of covariates can even be larger than the number of datapoints, it is natural to wonder why low-dimensional settings are still important. First, in application areas such as econometrics, linear regressions with as few as two to four covariates are very common (Britto et al., 2022; Bianchi & Bigio, 2022; Hopenhayn et al., 2022) , often serving as proofs-of-concept for more complex models. Second, even in settings where the number of covariates is larger, it is often the expectation that few covariates are relevant. In such applications, analysis often consists of a variable selection step followed by regression on a much-reduced set of covariates (Cai & Wang, 2011) . In all these settings, understanding the stability of an estimator is important, and our work gives some of the first provable guarantees that avoid making strong distributional assumptions. Moreover our lower bounds show that certifying stability of truly high-dimensional models, even linear ones, is intractable.

1.1. FORMAL PROBLEM STATEMENT

We are given a deterministic and arbitrary set of n samples (X i , y i ) n i=1 , where each X i is a vector of d real-valued covariates, and each y i is a real-valued response. We are interested in a single coefficient of the OLS regressor (without loss of generality, the first coordinate): in an application, the first covariate may be the treatment and the rest may be controls. The sign of this coefficient is important because it estimates whether the treatment has a positive or negative effect. Thus, we want to determine if it can be changed by dropping a few samples from the regression. Formally, we consider the fractional relaxation, where we allow dropping fractions of samples: Definition 1.1. Fix (X i , y i ) n i=1 with X 1 , . . . , X n ∈ R d and y 1 , . . . , y n ∈ R. For any w ∈ [0, 1] n , the weight-w OLS solution set of (X i , y i ) n i=1 is OLS(X, y, w) := arg min β∈R d 1 n n i=1 w i (⟨X i , β⟩ -y i ) 2 . The finite-sample stability of (X i , y i ) n i=1 is Stability(X, y) := inf w∈[0,1] n ,β∈R d {n -∥w∥ 1 : β 1 = 0 and β ∈ OLS(X, y, w)}. This is the minimum number of samples (in a fractional sense) which need to be removed to zero out the first coordinate of the OLS regressor. If the OLS solution set contains multiple regressors, then it suffices if any regressor β in the solution set has β 1 = 0. Our algorithmic goal is to compute Stability(X, y), or at least to approximate Stability(X, y) up to an additive ϵn error.

1.2. RESULTS

By brute-force search, the (integral) stability can be computed in time 2 n • poly(n). However, because the complexity is exponential in the number of samples, it is computationally infeasible even when the dimension d of the data is low, which is a common situation in many scientific applications. Similarly, the fractional stability (Definition 1.1) is the solution to a non-convex optimization problem in more than n variables, which seems no simpler. Can we still hope for a polynomial-time algorithm in constant dimensions? We show that the answer is yes. Theorem 1.2. There is an n O(d 3 ) -time algorithm which, given n arbitrary samples (X i , y i ) n i=1 with X 1 , . . . , X n ∈ R d and y 1 , . . . , y n ∈ R, and given k ≥ 0, decides whether Stability(X, y) ≤ k. We also show that the exponential dependence on dimension d is necessary under standard complexity assumptions: Theorem 1.3. Under the Exponential Time Hypothesis, there is no n o(d) -time algorithm which, given (X i , y i ) n i=1 and k ≥ 0, decides whether Stability(X, y) ≤ k. This theorem in particular rules out fixed-parameter tractability, i.e. algorithms with time complexity f (d) • poly(n). However, it only applies to exact algorithms. In practice, it is unlikely to matter whether Stability(X, y) = 0.01n or Stability(X, y) = 0.02n; in both cases, the conclusion is sensitive to dropping a very small fraction of the data. This motivates our next two algorithmic results on ϵn-additive approximation of the stability (where we think of ϵ > 0 as a constant). First, we make a mild anti-concentration assumption, under which the stability can ϵn-approximated in time roughly n d+O(1) . While still not fixed-parameter tractable, this algorithm can now be run on moderate sized problems in low dimensions, unlike the algorithm in Theorem 1.2. Assumption A. Let ϵ, δ > 0. We say that samples (X i , y i ) n i=1 satisfy (ϵ, δ)-anti-concentration if for every β ∈ R d , it holds that i ∈ [n] : |⟨X i , β⟩ -y i | < δ √ n Xβ (0) -y 2 ≤ ϵn, where X : n × d is the matrix with rows X 1 , . . . , X n , and β (0) ∈ OLS(X, y, 1) is any unweighted OLS regressor of y against X. See Appendix F.1 for discussion of when Assumption A holds. Under this assumption, we present an O(ϵn)-approximation algorithm: Theorem 1.4. For any ϵ, δ, η > 0, there is an algorithm PARTITIONANDAPPROX with time complexity n + Cd ϵ 2 log 1 δ log 1 ϵη d+O(1) which, given ϵ, δ, η, and samples (X i , y i ) n i=1 satisfying (ϵ, δ)-anti-concentration, returns an estimate Ŝ such that with probability at least 1 -η, | Ŝ -Stability(X, y)| ≤ 12ϵn + 1. In fact, PARTITIONANDAPPROX also can detect when Assumption A fails (see Theorem D.6 for a precise statement), so it can be used to compute unconditional lower bounds on stability with high probability (where the lower bound is provably tight if the data satisfies anti-concentration). Moreover, as discussed in Appendix F.1, the required anti-concentration is very mild. If ϵ, η > 0 are constants, the algorithm has time complexity n d+O(1) , so long as the samples satisfy (ϵ, exp(-Ω(n)))-anticoncentration. This is true for arbitrary smoothed data. Finally, unlike the exact algorithm, PARTI-TIONANDAPPROX avoids heavy algorithmic machinery; it only requires solving linear programs. Fixed-parameter tractability? Our final result is that ϵn-approximation of the stability is in fact fixed-parameter tractable, under a stronger anti-concentration assumption. Assumption B. Let ϵ, δ > 0. We say that samples (X i , y i ) n i=1 satisfy (ϵ, δ)-strong anti-concentration if for every β ∈ R d+1 , it holds that i ∈ [n] : |⟨X i , β⟩| < δ √ n Xβ 2 ≤ ϵn where X : n × (d + 1) is the matrix with columns (X T ) 1 , . . . , (X T ) d , y. This assumption holds with constant ϵ, δ > 0 under certain distributional assumptions on (X i , y i ) n i=1 , e.g. centered Gaussian mixtures with uniformly bounded condition number (Appendix F.2). Theorem 1.5. For any ϵ, δ > 0, there is a ( √ d/(ϵδ 2 )) d • poly(n)-time algorithm NETAPPROX which, given ϵ,δ, and samples (X i , y i ) n i=1 satisfying (ϵ, δ)-strong anti-concentration, returns an estimate Ŝ satisfying Stability(X, y) ≤ Ŝ ≤ Stability(X, y) + 3ϵn + 1. Moreover, Stability(X, y) ≤ Ŝ holds for arbitrary (X i , y i ) n i=1 . Extensions. Another model, frequently used in causal inference and econometrics, is instrumental variables (IV) linear regression. When the noise η in a hypothesized causal relationship y = ⟨X, β * ⟩ + η is believed to be endogenous (i.e. correlated with X), a common approach (Sargan, 1958; Angrist et al., 1996; Card, 2001) is to find a p-dimensional variable Z (the instrument) for which domain knowledge suggests that E[η|Z] = 0. Positing that β * is identified by the moment condition E[Z(y -⟨X, β⟩)] = 0, the IV estimator set given samples (X i , y i , Z i ) n i=1 is then IV(X, y, Z) = {β ∈ R d : Z T (w ⋆ (Xβ -y) = 0} where a ⋆ b denotes elementwise product, and Z : n × p and X : n × d are the matrices of instruments and covariates respectively. Stability can be defined as in Definition 1.1. Although for simplicity we state all of our results for OLS (i.e. the special case Z = X), it can be seen that Theorem 1.2 and Theorem 1.5 both extend directly to the IV regression setting. See Appendix G for further discussion. Experiments. We implement modifications of NETAPPROX and PARTITIONANDAPPROX which give unconditional, exact upper and lower bounds on stability, respectively. We use these algorithms to obtain tight data-dependent bounds on stability of isotropic Gaussian datasets for a broad range of signal-to-noise ratios, and we demonstrate heterogeneous synthetic datasets where our algorithms' upper bounds are an order of magnitude better than upper bounds obtained by the prior heuristic. On the Boston Housing dataset (Harrison Jr & Rubinfeld, 1978) , we regress house values against all pairs of features. For the majority of these regressions, we bound the stability within a factor of two. On the one hand, we detect many sensitive conclusions (including some which the greedy heuristic claims are stable); on the other hand, we certify that some conclusions are stable to dropping as much as half the dataset.

1.3. ORGANIZATION

In Section 2 we review related work. In Section 3 we collect notation and formulas that will be useful later. In Section 4 we sketch the intuition behind our algorithmic results. Section 5 covers our experiments. In Appendices B, C, D, and E we prove Theorems 1.2, 1.3, 1.4, and 1.5 respectively.

2. RELATED WORK

There is a rich literature on topics related to finite-sample stability, including sensitivity analysis and robustness to distribution shift and contamination. Due to space constraints, here we only discuss the works most relevant to ours, and we postpone broader discussion to Appendix A. Most directly related is the prior work on heuristics for the (integral) stability (Broderick et al., 2020; Kuschnig et al., 2021) . The heuristic given by Broderick et al. (2020) (to approximate the most-influential k samples) is simply the local approximation: compute the local influence of each sample at w = 1, sort the samples from largest to smallest influence, and output the top k samples. Subsequent work (Kuschnig et al., 2021) refines this heuristic by recomputing the influences after removing each sample, which alleviates issues such as masking (Chatterjee & Hadi, 1986) . But this is still just a greedy heuristic, and it may fail when samples are jointly but not individually influential. Except under the strong assumption that the sample covariance remains nearly constant when we remove any ϵn samples (see Theorem 1 in Broderick et al. (2020) , which relies on Condition 1 in Giordano et al. (2019) ), the local influence approach can upper bound the finite-sample stability but cannot provably lower bound it. In fact, in Section 5 we provide examples where the greedy heuristic of Kuschnig et al. (2021) is very inaccurate due to instability in the sample covariance. Closely related to finite-sample stability, the s-value (Gupta & Rothenhäusler, 2021) is the minimum Kullback-Leibler divergence D(P ||P 0 ) over all distributions P for which the conclusion is null, where P 0 is the empirical distribution of the samples. Unfortunately, while the s-value is an interesting and well-motivated metric, computing the s-value for OLS estimation appears to be computationally intractable, and the algorithms given by Gupta & Rothenhäusler (2021) lack provable guarantees.

3. PRELIMINARIES

For vectors u, v ∈ R m , we let u ⋆ v denote the elementwise product (u ⋆ v) i = u i v i . Throughout the paper, we will frequently use the closed-form expression for the (weighted) OLS solution set OLS(X, y, w) = {β ∈ R d : X T (w ⋆ (Xβ -y)) = 0} where X : n × d is the matrix with rows X 1 , . . . , X n . In particular, setting λ = β 2:d , this means that the finite-sample stability can be rewritten as Stability(X, y) = inf w∈[0,1] n ,λ∈R d-1 {n -∥w∥ 1 : X T (w ⋆ ( Xλ -y)) = 0} where (here and throughout the paper) X : n×(d-1) is the matrix with columns (X T ) 2 , . . . , (X T ) d . Remark 3.1. As previously mentioned, the finite-sample stability always lower bounds the integral stability (the minimum number of samples that need to be removed to make the first coordinate of the regressor change sign), by continuity of the OLS solution set in w. Additionally, it can be seen from Equation 1that a partial converse holds in low dimensions. For any feasible (w, λ), the set of w ′ such that (a) (w ′ , λ) is feasible, and (b) ∥w∥ 1 = ∥w ′ ∥ 1 , has the form [0, 1] n ∩ V for some subspace V ⊆ R n of codimension at most d + 1. Thus, there is some w ′ ∈ [0, 1] n ∩ V with at most d + 1 non-integral weights. If Stability(X, y) = αn, then w ′ witnesses that the first coordinate of the OLS regressor can be zeroed out by downweighting at most αn + d + 1 samples.

4. OVERVIEW OF ALGORITHMS

An exact algorithm. Our main tool for Theorem 1.2 is the following special case of an important result due to Renegar (1992) on solving quantified polynomial systems of inequalities: Theorem 4.1 (Renegar (1992) ). Given an expression ∀x ∈ R n1 : ∃y ∈ R n2 : P (x, y), where P (x, y) is a system of m polynomial inequalities with maximum degree d, the truth value of the expression can be decided in time (md) O(n1n2) .foot_0  Roughly, for a constant number of quantifier alternations, a quantified polynomial system can be decided in time exponential in the number of variables. Unfortunately, a naive formulation of the expression Stability(X, y) ≤ k, by direct evaluation of Equation 1, has n + d -1 variables: ∃λ ∈ R d-1 , w ∈ [0, 1] n : n i=1 w i ≥ n -k and X T (w ⋆ ( Xλ -y)) = 0. Intuitively, it may not be necessary to search over all w ∈ [0, 1] n ; for fixed λ, the maximum-weight w is described by a simple linear program. Formally, the linear program can be rewritten (Lemma B.1) by the separating hyperplane theorem, so that the overall expression becomes: ∃λ ∈ R d-1 : ∀u ∈ R d : ∃w ∈ [0, 1] n : ∥w∥ 1 ≥ n -k and n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0. (2) Now, for fixed λ and u, the maximum-weight w has very simple description: it only depends on the relative ordering of the n summands (⟨ X, λ⟩ -y i )⟨X i , u⟩. By classical results on connected components of varieties, since the summands have only 2d -1 variables, the number of achievable orderings is only n Ω(d) rather than n!, and the orderings can be enumerated efficiently (Milnor, 1964; Renegar, 1992) . This allows the quantifier over w ∈ [0, 1] n to be replaced by a quantifier over the n Ω(d) achievable orderings, after which Theorem 4.1 implies that the overall expression can be decided in time n Ω(d 3 ) . See Appendix B for details. Approximation via partitioning. Next, we show how to avoid the heavy algorithmic machinery used in the previous result. For Theorem 1.4, the strategy of PARTITIONANDAPPROX is to partition the OLS solution space R d-1 into roughly n d regions, such that if we restrict λ to any one region, the bilinear program which defines the stability can be approximated by a linear program. Concretely, we start by writing the formulation (1) as n -Stability(x, y) = sup w∈[0,1] n ,λ∈R d-1    i∈[n] w i X T (w ⋆ ( Xλ -y)) = 0    . This has a nonlinear (and nonconvex) constraint due to the pointwise product between w and the residual vector Xλ -y. Thus, we can introduce the change of variables g i = w i (⟨ Xi , λ⟩ -y i ) for i ∈ [n]. This causes two issues. First, the constraint 0 ≤ w i ≤ 1 becomes 0 ≤ g i /(⟨X i , λ⟩ -y i ) ≤ 1, which is no longer linear. To fix this, suppose that instead of maximizing over all λ ∈ R d-1 , we maximize over a region R ⊆ R d-1 where each residual ⟨ X, λ⟩ -y i has constant sign σ i . The constraint 0 ≤ w i ≤ 1 then becomes one of two linear constraints, depending on σ i . Let V R denote the value of Program 3 restricted to λ ∈ R. Then with the change of variables, we have that V R = sup g∈R n ,λ∈R      i∈[n] g i ⟨X i , λ⟩ -y i X T g = 0 0 ≤ g i ≤ ⟨X i , λ⟩ -y i ∀i ∈ [n] : σ i = 1 ⟨X i , λ⟩ -y i ≤ g i ≤ 0 ∀i ∈ [n] : σ i = -1      , with the convention that 0/0 = 1. Now the constraints are linear. Unfortunately, (and this is the second issue), the objective is no longer linear. The solution is to partition the region R further: if the region were small enough that every residual ⟨X i , λ⟩ -y i had at most (1 ± ϵ)-multiplicative variation, then the objective could be approximated to within 1 ± ϵ by a linear objective. How many regions do we need? Let M = Xβ (0) -y 2 be the unweighted OLS error. If all the residuals were bounded between δM/ √ n and M in magnitude, for all λ ∈ R d-1 , then the regions could be demarcated by O(n log 1+ϵ (n/δ)) hyperplanes, for a total of O(n log 1+ϵ (n/δ)) d regions. Of course, for some λ, some residuals may be very small or very large. But (ϵ, δ)-anti-concentration implies that for every λ, at most ϵn residuals are very small, and it can be shown that if λ is a weighted OLS solution, the total weight on samples with large residuals is low. Thus, for any region, we can exclude from the objective function the samples with residuals that are not well-approximated within the region, and this only affects the objective by O(ϵn). This gives an algorithm with time complexity (nϵ -1 log(1/δ)) d+O(1) . To achieve the time complexity in Theorem 1.4, where the log(n/δ) is additive rather than multiplicative, we use subsampling. Every residual is still partitioned by sign, but we multiplicatively partition only a random Õ(d/ϵ)-size subset of the residuals. Intuitively, most residuals will still be well-approximated in any given region. This can roughly be formalized via a VC dimension argument, albeit with some technical complications. See Section D for details and Appendix J for formal pseudocode of the algorithm. Net-based approximation. The algorithm NETAPPROX for Theorem 1.5 is intuitively the simplest. For any fixed λ ∈ R d-1 , Program 1 reduces to a linear program with value denoted S(λ). Thus, an obvious approach is to construct a net N ⊆ R d-1 in some appropriate metric, and compute min λ∈N S(λ). This always upper bounds the stability, but to prove that it's an approximate lower bound, we need S(λ) to be Lipschitz under the metric. The right metric turns out to be d(λ, λ ′ ) = Xλ -y Xλ -y 2 - Xλ ′ -y Xλ ′ -y 2 2 . Under this metric, R d-1 essentially embeds into a d-dimensional subspace of the Euclidean sphere S n-1 , and therefore has a γ-net of size O(1/γ) d . Why is S(λ) Lipschitz under d? First, if Xλ -y equals Xλ ′ -y up to rescaling, then it can be seen from Program 3 that S(λ) = S(λ ′ ). More generally, if the residuals are close up to rescaling, we apply the dual formulation of S(λ) from expression (2): n -S(λ) = inf u∈R d sup w∈[0,1] n ∥w∥ 1 : n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0. For any u, the optimal w for λ and u can be rounded to some feasible w ′ for λ ′ and u without decreasing the ℓ 1 norm too much, under strong anti-concentration. This shows that S(λ) and S(λ ′ ) are close. See Appendix E for details and Appendix J for formal pseudocode of NETAPPROX.

5. EXPERIMENTS

In this section, we apply (modifications of) NETAPPROX and PARTITIONANDAPPROX to several datasets in two and three dimensions. Due to space constraints, we defer detailed discussion of the algorithmic modifications to Appendix I.1; we simply note that the modifications are made to improve practical efficiency and usability. Most saliently, the modified algorithms do not rely on Assumptions A and B: the modified NETAPPROX provides an unconditional upper bound on stability (referred to henceforth as "net upper bound"), and the modified PARTITIONANDAPPROX provides an unconditional lower bound ("LP lower bound"). As a result, we are able to experimentally verify that our algorithms provide tight (and unconditional) bounds on stability for a variety of datasets. As a baseline upper bound, we implement the greedy heuristic of Kuschnig et al. (2021) which refines Broderick et al. (2020) . We are not aware of any prior work on lower bounding stability, so we implement a simplification of our full lower bound algorithm as a baseline. See Appendix I for implementation details, hyperparameter choices, and discussion of error bars.

5.1. SYNTHETIC DATA

Heterogeneous data. We start with a simple two-dimensional dataset with two disparate subpopulations, where the greedy baseline fails to estimate the stability but our algorithms give tight estimates. For parameters n, k, and σ, we generate k independent samples (X i , y i ), where X i ∈ R 2 has independent coordinates X i1 ∼ N (-1, 0.01) and X i2 ∼ N (0, 1), and y i = X i1 . Then, we generate n -k independent samples (X i , y i ) where X i1 = 0 and X i2 ∼ N (0, 1), and y ∼ N (0, 1). It always suffices to remove the first subpopulation, so the stability is at most k. However, the first subpopulation has small individual influences, because the OLS regressor on the whole dataset can nearly interpolate the first subpopulation. Thus, we expect that the greedy algorithm will fail to notice the first subpopulation, and therefore remove far more than k samples. Indeed, this is what happens. For n = 1000 and k varying from 10 to 500, we compare our net upper bound and LP lower bound with the baselines. As seen in Figure 1 , our methods are always better Covariance shift. In the previous example, removing k samples caused a pathological change in the sample covariance; it became singular. However, even modest, constant-factor instability in the sample covariance can cause the greedy algorithm to fail; see Appendix I.5 for details. Isotropic Gaussian data. Instability can arise even in homogeneous data, as a result of a low signalto-noise ratio (Broderick et al., 2020) . But when the noise level is low, can we certify stability? For a broad range of noise levels, we experimentally show that this is the case. Specifically, for d ∈ {2, 3} and noise parameter σ ranging from 0.1 to 10, we generate n independent samples (X i , y i ) n i=1 where X i ∼ N (0, I d ) and y i = ⟨X i , 1⟩ + N (0, σ 2 ). For d = 2 and n = 1000 (Figure 2a ), our LP lower bound is nearly tight with the upper bounds, particularly as the noise level increases (in comparison, the baseline lower bound quickly degenerates towards zero). For d = 3 and n = 500 (Figure 2b ), the bounds are looser for small noise levels but still always within a small constant factor.

5.2. BOSTON HOUSING DATASET

The Boston Housing dataset (Harrison Jr & Rubinfeld, 1978; Gilley et al., 1996) consists of data from 506 census tracts of Greater Boston in the 1970 Census. There are 14 real-valued features, one of which-the median house value in USD 1000s-we designate as the response. Unfortunately the entire set of features is too large for our algorithms, so for our experiments we pick various subsets of two or three features to use as covariates. A Tale of Two Datasets. We exemplify our results with two particular feature subsets. First, we investigate the effect of zn (percentage of residential land zoned for large lots) on house values, controlling for rm (average number of rooms per home) and rad (highway accessibility index) but no bias term. On the entire dataset, we find a modest positive effect: the estimated coefficient of zn is roughly 0.06. Both the greedy heuristic and our net algorithm find subsets of just 8% of the data (38-40 samples) which, if removed, would nullify the effect. But is this tight, or could there be a much smaller subset with the same effect? Our LP lower bound certifies that removing at least 22.4 samples is necessary. Second, we investigate the effect of zn on house values, this time controlling only for crim (per capita crime rate). Our net algorithm finds a subset of just 27% of the data which was driving the effect, and the LP lower bound certifies that the stability is at least 8%. But this time, the greedy algorithm removes 90% of the samples, a clear failure. What happened? Plotting zn against crim reveals a striking heterogeneity in the data: 73% of the samples have zn = 0, and the remaining 27% of the samples (precisely those removed by the net algorithm) have crim < 0.83, i.e. very low crime rates. As in the synthetic example, this heterogeneity explains the greedy algorithm's failure. But heterogeneity is very common in real data: in this case, it's between the city proper and the suburbs, and in fact the OLS regressors of these two subpopulations on all 13 features are markedly different (Appendix I.6). Thus, it's important to have algorithms with provable guarantees for detecting when heterogeneity causes (or doesn't cause) unstable conclusions. All-feature-pairs analysis. To be thorough, we also apply our algorithms to all 156 ordered pairs of features. For each pair, we regress the response (i.e. median house value) against the two features by Ordinary Least Squares, and we use our algorithms on this 2-dimensional dataset to estimate how many samples need to be removed to nullify the effect of the first feature on the response. We also compare to the greedy upper bound. See Figure 3 for a perspective on the results. In each figure, each point corresponds to the results of one dataset. The left figure plots the net upper bound against the greedy upper bound: we can see that our net algorithm substantially outperforms the greedy heuristic on some datasets (i.e. finds a much smaller upper bound) and never performs much worse. The right figure plots the LP lower bound against the net upper bound (along with the line y = x). For a majority of the datasets, the upper bound and lower bound are close. Concretely, for 116 of the 156 datasets, we certifiably estimate the stability up to a factor of two -some are sensitive to removing less than 10 samples, and some are stable to removing even a majority of the samples.

6. CONCLUSIONS

In this work, we studied efficient estimation of the stability of OLS regressions to removing subsets of the training data. We showed that in low dimensions the problem is both theoretically and experimentally tractable, whereas in high dimensions exact computation of the stability likely requires exponential time. However, this is only the beginning of the story. Most immediately, since our lower bound algorithm takes time n Ω(d) , our experiments were limited to no more than three dimensions. Certifying stability of OLS regressions from e.g. recent econometric studies may require additional heuristics or insights (e.g. developing a fixed-parameter tractable lower bound algorithm). Beyond that, identifying reasonable assumptions under which exponential dependence on dimension can be entirely circumvented is another valuable direction for future work. Of course, machine learning extends far beyond linear regression, and for more and more complex and opaque models, stability certification is all the more crucial as a tool for enhancing trustworthiness. Certainly, OLS is important in its own right, but inasmuch as it is a key building block in more complex machine learning systems (from regression trees (Loh, 2011) to generative adversarial networks (Mao et al., 2017) and policy iteration in linear MDPs (Lagoudakis & Parr, 2003 )), our work on estimating stability of OLS is also a first step towards estimating stability for these systems. Finally, we remark that care must be taken when interpreting stability in practice. Large stability may increase trust in a model's parameters or predictions, but it does not mean that conclusions drawn from the model are "correct." Conversely, even if the stability is small, the conclusions may still be useful, with the caveat that they may be driven by a small sub-population. Understanding whether this heterogeneity is problematic or not is context-dependent, and is a separate but important issue.

A FURTHER RELATED WORK

Local and global sensitivity metrics. Post-hoc evaluation of the sensitivity of a statistical inference to various types of model misspecification has long been recognized as an important research direction. Within this area, there is a distinction between local sensitivity metrics, which measure the sensitivity of the inference to infinitesimal misspecifications of the assumed model M 0 (e.g. Polasek (1984) ; Castillo et al. (2004) ; Belsley et al. (1980) ), and global sensitivity metrics, which measure the set of possible inferences as the model ranges in some fixed set M around M 0 (e.g. Leamer (1984) ; Tanaka et al. (1989) ; Černỳ et al. (2013) ). For OLS in particular, there is a well-established literature on the influences of individual data points (Cook, 1977; Chatterjee & Hadi, 1986) , which falls under local sensitivity analysis, since deleting a single data point is an infinitesimal perturbation to a dataset of size n as n → ∞. In contrast, identifying jointly influential subsets of the data (the "global" analogue) has been a long-standing challenge due to computational issues (see e.g. page 274 of Belsley et al. (1980) ). Existing approaches typically focus on identifying outliers in a generic sense rather than with respect to a specific inference (Hadi & Simonoff, 1993) , or study computationally tractable variations of deletion (e.g. constant-factor reweighting (Leamer, 1984) ). Robustified estimators. Ever since the work of Tukey and Huber, one of the central areas of statistics has been robustifying statistical estimators to be resilient to outliers (see, e.g. Huber ( 2004)). While a valuable branch of research, we view robust statistics as incomparable if not orthogonal to post-hoc sensitivity evaluation, for three reasons. First, samples that drive the conclusion (in the sense that deleting them would nullify the conclusion) are not synonymous with outliers: removing an outlier that works against the conclusion only makes the conclusion stronger. Indeed, outlier-trimmed datasets are not necessarily finite-sample robust (Broderick et al., 2020) . Rather, finite-sample stability (along with the s-value (Gupta & Rothenhäusler, 2021) ), in the regime where a constant fraction of samples is removed, may be thought of as a measure of resilience to heterogeneity and distribution shift. Second, it is unreasonable to argue that using robustified estimators obviates the need for sensitivity evaluation. Robust statistics has seen a recent algorithmic revival, with the development of computationally efficient estimators, for problems such as linear regression, that are robust in the strong contamination model (e.g. Klivans et al. (2018) ; Diakonikolas et al. (2019) ; Bakshi & Prasad (2021) ). However, even positing that the strong contamination model is correct, estimation guarantees for these algorithms require strong, unverifiable (and unavoidable (Klivans et al., 2018) ) assumptions about the uncorrupted data, such as hypercontractivity. Sensitivity analyses should support modeling assumptions, not depend upon them. Third and perhaps most salient, classical estimators such as OLS are ubiquitous in practice, despite the existence of robust estimators. This alone justifies sensitivity analysis of the resulting scientific conclusions. Distributionally robust optimization. A recent line of work in machine learning (Sinha et al., 2017; Duchi & Namkoong, 2018; Cauchois et al., 2020; Jeong & Namkoong, 2020) suggests that the lack of resilience of Empirical Risk Minimization to distribution shift can be mitigated by minimizing the supremum of risks with respect to distributions near the empirical training distribution (under e.g. Wasserstein distance or an f -divergence). Again, this approach of robustifying the estimator is valuable but incomparable to sensitivity analysis. B PROOF OF THEOREM 1.2 In this section, we show how to exactly compute the stability of a d-dimensional dataset in time n O(d 3 ) , proving Theorem 1.2. Our main tool is Theorem 4.1, a special case of an important result due to Renegar (1992) on solving quantified polynomial systems of inequalities. The expression Stability(X, y) ≤ k can indeed be written as a polynomial system of (degree-2) equations, with only an ∃ quantifier. Unfortunately, the number of variables in this naive formulation is n + d -1 (n for the weights and d -1 for the regressor), which yields an algorithm exponential in n. Thus, to take advantage of the above theorem, we need to reformulate the expression with fewer variables. The following lemma rewrites the stability, via the separation theorem for convex sets, in a form where the variable reduction will become apparent. Lemma B.1. For any (X i , y i ) n i=1 and k ≥ 0, it holds that Stability(X, y) ≤ k if and only if ∃λ ∈ R d-1 : ∀u ∈ R d : ∃w ∈ [0, 1] n : ∥w∥ 1 ≥ n -k ∧ n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0, (4) where X : n × (d -1) is the matrix with columns (X T ) 2 , . . . , (X T ) d . Proof. From formulation (1) of the stability, we know that Stability(X, y) ≤ k if and only if ∃λ ∈ R d-1 : ∃w ∈ [0, 1] n : ∥w∥ 1 ≥ n -k ∧ X T (w ⋆ ( Xλ -y)) = 0. Fix λ ∈ R d-1 . Define the set D(n -k) = (w ⋆ ( Xλ -y)) : w ∈ [0, 1] n ∧ ∥w∥ 1 ≥ n -k . We are interested in the predicate D(n-k)∩ker(X T ) ̸ = ∅, or equivalently 0 ∈ D(n-k)+ker(X T ). Observe that D(n -k) is convex, since w ranges over a convex set. Thus, by the separation theorem for a point and a convex set, 0 ∈ D(n -k) + ker(X T ) if and only if for every v ∈ R n , we have sup x∈D(n-k)+ker(X T ) ⟨v, x⟩ ≥ 0. If v is not orthogonal to ker(X T ), then the inner product can be made arbitrarily large. Thus, it suffices to restrict to v ∈ span(X T ), in which case the supremum is simply over x ∈ D(n -k). That is, 0 ∈ D(n -k) ∩ ker(X T ) if and only if ∀u ∈ R d : ∃w ∈ [0, 1] n : ∥w∥ 1 ≥ n -k ∧ Xu, (w ⋆ ( Xλ -y)) ≥ 0. Quantifying over λ, we get the claimed expression. The expression in Lemma B.1 still has O(n) variables. However, we can now actually eliminate the variable w at the cost of increasing the number of equations. This is because the optimal w for fixed λ and u only depends on the relative order of the terms (⟨ Xi , λ⟩ -y i )⟨X i , u⟩. We make the following definition: Definition B.2. For any λ ∈ R d-1 and u ∈ R d , let π(λ, u) be the unique permutation on [n] such that for all 1 ≤ i ≤ n -1, (⟨ Xπi , λ⟩ -y πi )⟨X πi , u⟩ ≥ (⟨ Xπi+1 , λ⟩ -y πi+1 )⟨X πi+1 , u⟩, and such that equality implies π i < π i+1 . Let Π = {π(λ, u) : λ ∈ R d-1 , u ∈ R d }. Then it can be seen that for fixed λ and u, the optimal choice of w has coefficients 1 on π(λ, u) 1 , . . . , π(λ, u) ⌊n-k⌋ , and coefficient n -k -⌊n -k⌋ for π(λ, u) ⌊n-k⌋+1 : if there is any feasible w which makes the sum non-negative, then this choice of w makes the sum non-negative as well. Denoting this vector by w(π(λ, u)), we have that in Equation 4 it suffices to restrict to w ∈ {w(π) : π ∈ Π}. A priori, the number of achievable permutations could be n!, in which case we would not have gained anything. However, because π(λ, u) is defined by low-degree polynomials in only 2d -1 variables, we can actually show that |Π| is at most exponential in d, using the following result: Theorem B.3 (Sign Partitions (Milnor, 1964; Renegar, 1992) ). Let g 1 , . . . , g m : R n → R be arbitrary polynomials each with total degree at most d. Let SG(g) be the set of vectors σ ∈ {-1, 0, 1} m such that σ is an achievable sign vector, i.e. there exists some x ∈ R n with sign(g i ) = σ i for all i ∈ [m]. Then |SG(g)| ≤ (md) O(n) . Moreover, SG(g) can be enumerated in time (md) O(n) . Putting everything together, we have the following theorem, which proves Theorem 1.2. Theorem B.4. For any permutation π on [n], define w(π) ∈ [0, 1] n by w(π) πi =    1 if i ≤ n -k n -k -⌊n -k⌋ if i = n -k + 1 0 otherwise . Then for any k ∈ [0, n], it holds that Stability(X, y) > k if and only if ∀λ ∈ R d-1 : ∃u ∈ R d : ∀π ∈ Π : n i=1 w(π) i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ < 0. (5) Moreover, Π can be enumerated in time n O(d) . Thus, the expression Stability(X, y) > k can be decided in time n O(d 3 ) . Proof. Fix λ ∈ R d-1 and u ∈ R d . If ∃π ∈ Π : n i=1 w(π) i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0, then because ∥w(π)∥ 1 ≥ n -k, we obviously get ∃w ∈ [0, 1] n : ∥w∥ 1 ≥ n -k ∧ n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0. Conversely, if ( 6) is false, then in particular w(π(λ, u)) produces a negative sum n i=1 w(π(λ, u)) i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩. But by construction, w(π(λ, u)) maximizes this sum, over all w ∈ [0, 1] n with ∥w∥ 1 = n -k. Therefore no weight vector with ℓ 1 norm exactly n -k produces a nonnegative sum, and increasing the norm cannot help. Thus, ( 7) and ( 6) are equivalent. Quantifying over λ and u, we have Stability(X, y) ≤ k if and only if ∃λ ∈ R d-1 : ∀u ∈ R d : ∃π ∈ Π : n i=1 w(π) i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0. Taking the negation yields expression (5). If we can compute Π, then this expression is a ∀∃-system of polynomial inequalities with 2d -1 variables and |Π| degree-2 inequalities, so by Theorem 4.1 it can be decided in time |Π| O(d 2 ) . It remains to show that Π can be enumerated in time n O(d) (which bounds |Π|). For any i, j ∈ [n] with i < j define the polynomial f i,j (λ, u) = (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ -(⟨ Xj , λ⟩ -y j )⟨X j , u⟩. For any λ and u, the permutation π(λ, u) is determined by the signs of the polynomials {f i,j } i<j at (λ, u). But by Theorem B.3, the set of sign vectors can be computed in time n O(d) . So Π can be found in time n O(d) as well.

C PROOF OF THEOREM 1.3

In this section, we prove Theorem 1.3. That is, we show that exactly computing the stability requires n Ω(d) time under the Exponential Time Hypothesis, by a simple reduction from the Maximum Feasible Subsystem problem. This latter problem is already known to take n Ω(d) time in d dimensions under the Exponential Time Hypothesis: Theorem C.1 (Theorem 13 in Giannopoulos et al. (2009) ). Suppose that there is an n o(d) -time algorithm for the following problem: given n vectors v 1 , . . . , v n ∈ R d , real numbers c 1 , . . . , c n ∈ R, and an integer 0 ≤ k ≤ n, determine whether max λ∈R d n i=1 1[⟨v i , λ⟩ = c i ] ≥ k. Then the Exponential Time Hypothesis is false. From this, it can be easily seen that (exactly) computing stability also requires n Ω(d) time. Theorem C.2. Suppose that there is an n o(d) -time algorithm for the following problem: given X 1 , . . . , X n ∈ R d and y 1 , . . . , y n ∈ R, as well as an integer 0 ≤ k ≤ n, determine whether Stability(X, y) ≤ n -k. Then the Exponential Time Hypothesis is false. Proof. We reduce to Maximum Feasible Subsystem. Given v 1 , . . . , v n ∈ R d and c 1 , . . . , c n ∈ R, define X i = (c i , v i ) ∈ R d+1 and y i = c i . Then the regressor e 1 = (1, 0, . . . , 0) ∈ R d+1 perfectly fits the data set, i.e. n i=1 (⟨X i , e 1 ⟩ -y i ) 2 = 0. Thus, for any w ∈ [0, 1] n , OLS(X, y, w) = β ∈ R d+1 : n i=1 w i (⟨X i , β⟩ -y i ) 2 = 0 . Suppose that Stability(X, y) ≤ n -k. Then there is some w ∈ [0, 1] n and β 1 = 0 with ∥w∥ 1 ≥ k and n i=1 w i (⟨X i , β⟩ -y i ) 2 = 0. Let S ⊆ [n] be the support of w; then |S| ≥ k. Moreover for every i ∈ S, it holds that ⟨X i , β⟩ -y i = 0. Since β 1 = 0, by definition of X i and y i , this implies that ⟨v i , β 2:d+1 ⟩ = c i . Thus, n i=1 1[⟨v i , β 2:d+1 ⟩ = c i ] ≥ k. Conversely, suppose that there exists some λ ∈ R d and set S ⊆ [n] of size k such that ⟨v i , λ⟩ = c i for all i ∈ S. Define w = 1 S ∈ [0, 1] n , and define β = (0, λ). Then it is clear that n i=1 w i (⟨X i , β⟩ -y i ) 2 = i∈S (⟨v i , λ⟩ -c i ) 2 = 0. Thus, β ∈ OLS(X, y, w), so Stability(X, y) ≤ n -k. This completes the reduction. Remark C.3. Maximum Feasible Subsystem does have an ϵn-additive approximation algorithm in time Õ((d/ϵ) d ), by subsampling. Thus, proving n Ω(d) -hardness for ϵn-approximation of stability would require a different technique.

D PROOF OF THEOREM 1.4

In this section, we prove Theorem 1.4. The main idea of PARTITIONANDAPPROX is that under Assumption A, we can approximate the stability by partitioning R d-1 into roughly n d regions and solving a linear program on each region. See Appendix J for the complete algorithm.

D.1 PARTITIONING SCHEME

Given samples (X i , y i ) n i=1 , let M = Xβ (0) -y 2 , where β (0) ∈ OLS(X, y, 1). Let S ⊆ [n] be a uniformly random size-m subset of [n]. Let R 1 , . . . , R p be the closed connected subsets of R d-1 cut out by the following set of equations E:            ⟨ Xi , λ⟩ -y i = 0 ∀i ∈ [n] ⟨ Xi , λ⟩ -y i = σM ∀i ∈ [n], ∀σ ∈ {-1, 1} ⟨ Xi , λ⟩ -y i = σδM/ √ n ∀i ∈ [n], ∀σ ∈ {-1, 1} ⟨ Xi , λ⟩ -y i = σ(1 + ϵ) k δM/ √ n ∀i ∈ S, ∀σ ∈ {-1, 1}, ∀0 ≤ k ≤ ⌈log 1+ϵ ( √ n/δ)⌉ Formally, we define a region for every feasible assignment of equations to {=, <, >}, and then replace each strict inequality by a non-strict inequality, so that the region is closed. First, we observe that the number of regions is not too large, and in fact we can enumerate the regions efficiently.  i ∈ [n] such that |⟨ Xi , λ * ⟩ -y i | ≤ δM/ √ n. If m ≥ C ϵ d log 1 ϵ + log 1 η for an absolute constant C, then with probability at least 1 -η over the choice of S, the following holds. For every λ ∈ R, the number of i ∈ [n] \ (B M ∪ B δM ) such that |⟨ Xi , λ⟩ -y i | |⟨ Xi , λ⟩ -y i | -1 > ϵ is at most ϵn. Proof. Let σ ∈ {-1, 0, 1} n be defined by σ i = sign(⟨ Xi , λ * ⟩ -y i ). Note that by construction of the regions, sign(⟨ Xi , λ⟩ -y i ) = σ i for every λ ∈ R. Let S ′ = S ∩ ([n] \ (B M ∪ B δM )). If n ′ := |[n] \ (B M ∪ B δM )| ≤ ϵn, then the lemma statement is trivially true. Otherwise, by a Chernoff bound, Pr |S ′ | ≥ n ′ 2n m ≥ 1 -exp(-Ω(n ′ m/n)) ≥ 1 -η/3.

Condition on the event |S

′ | ≥ n ′ m/(2n). Then S ′ is uniform over size-m ′ subsets of [n] \ (B M ∪ B δM ). Define X = [n], and define a concept class C = {h λ : X → {0, 1}|λ ∈ R d-1 } as the set of binary functions h λ (i) = 1 σ i Xi , λ -(1 + ϵ)λ * + ϵy i ≤ 0 . Let D be the distribution on X × {0, 1} where (i, s) ∼ D has i ∼ Unif([n] \ (B M ∪ B δM )) and s = 1. Then S ′ × {1} consists of at least n ′ m/(2n) independent samples from D. For any λ ∈ R, we claim that the function h λ fits S ′ × {1} perfectly. Indeed, for any i ∈ S ′ , we know that δM/ √ n ≤ σ i (⟨ Xi , λ * ⟩ -y i ) ≤ M , since i ̸ ∈ B M ∪ B δM . So by construction of the regions, it holds that 1 1 + ϵ ≤ σ i (⟨ Xi λ⟩ -y i ) σ i (⟨ Xi λ * ⟩ -y i ) ≤ 1 + ϵ. From the right-hand side of the equation, we precisely get that h λ (i) = 1, as claimed. By Lemma D.4, we have vc(C) ≤ d. Thus, since |S ′ | ≥ n ′ m 2n ≥ Cd (2n/n ′ )ϵ log 1 ϵ + C (2n/n ′ )ϵ log 1 η , if C is a sufficiently large absolute constant, we can apply Theorem D.2 with failure parameter ϵ ′ := nϵ/(2n ′ ) to get that with probability at least 1 -η/3, sup λ∈R Pr (i,s)∼D [h λ (i) ̸ = s] ≤ ϵ ′ . Equivalently, for every λ ∈ R, the number of i ∈ [n] \ (B M ∪ B δM ) such that σ i (⟨ Xi , λ⟩ -y i ) > (1 + ϵ)σ i (⟨ Xi , λ⟩ -y i ) is at most ϵ ′ n ′ = ϵn/2. An identical argument with the binary functions g λ (i) = 1 σ i Xi , λ -(1 -ϵ)λ * -ϵy i ≥ 0 proves that with probability at least 1-η/3, for every λ ∈ R, the number of i ∈ [n]\(B M ∪B δM ) such that σ i (⟨ Xi , λ⟩ -y i ) < (1 -ϵ)σ i (⟨ Xi , λ⟩ -y i ) is at most ϵn/2. Overall, by the union bound (taking into account the event that |S ′ | < n ′ m/(2n)), with probability at least 1 -η it holds that for every λ ∈ R, the number of i ∈ [n]\(B M ∪B δM ) such that either σ i (⟨ Xi , λ⟩-y i ) > (1+ϵ)σ i (⟨ Xi , λ⟩-y i ) or σ i (⟨ Xi , λ⟩ -y i ) < (1 -ϵ)σ i (⟨ Xi , λ⟩ -y i ) is at most ϵn as claimed. It remains to bound the VC dimension of the function class. Lemma D.4. For any n, d > 0 and any (X i , y i ) i∈[n] ⊂ R d × R, the VC dimension of the concept class C = {h λ : [n] → {0, 1}|λ ∈ R d }, where h λ (i) = 1[⟨X i , λ⟩ + y i ≤ 0], is at most d + 1. Proof. Note that extending the domain of the concepts cannot decrease the VC dimension. Thus, if we define C ′ = {h ′ λ : R d × R → {0, 1} : λ ∈ R d } by h ′ λ (X, y) = 1[⟨X, λ⟩ + y ≤ 0], then vc(C ′ ) ≥ vc(C). But all of the concepts in C ′ are affine halfspaces in d + 1 dimensions, so vc(C ′ ) ≤ d + 1.

D.2 ALGORITHM

For every region R we can identify some arbitrary representative λ 0 (R) ∈ R and a sign pattern σ ∈ {-1, 1} n such that σ i = 1 implies ⟨ Xi , λ⟩ -y i ≥ 0 for all λ ∈ R, and σ i = -1 implies ⟨ Xi , λ⟩ -y i ≤ 0 for all λ ∈ R. Ideally, we want to compute V = max R V R where V R := sup g∈R n ,λ∈R      i∈[n] g i ⟨ Xi , λ⟩ -y i X T g = 0 0 ≤ g i ≤ ⟨ Xi , λ⟩ -y i ∀i ∈ [n] : σ i = 1 ⟨ Xi , λ⟩ -y i ≤ g i ≤ 0 ∀i ∈ [n] : σ i = -1      (8) However, this is not a linear program. Let B M = B M (R) ⊆ [n] be the set of i ∈ [n] such that |⟨ Xi , λ 0 (R)⟩ -y i | > M , and let B δM = B δM (R) ⊆ [n] be the set of i ∈ [n] such that |⟨ Xi , λ 0 (R) -y i | < δM/ √ n. For each region R we compute (ĝ(R), λ(R)) (9) := arg sup g∈R n ,λ∈R                i∈[n]\(B M ∪B δM ) min g i ⟨ Xi , λ 0 (R)⟩ -y i , 1 X T g = 0 0 ≤ g i ≤ ⟨ Xi , λ⟩ -y i ∀i ∈ [n] : σ i = 1 ⟨ Xi , λ⟩ -y i ≤ g i ≤ 0 ∀i ∈ [n] : σ i = -1                ( ) and define V (R) = i∈[n] ĝ(R) i ⟨ Xi , λ(R)⟩ -y i with the convention that 0/0 = 1. Note that although there is a min in the objective of Program 10, it is equivalent to a linear program by a standard transformation: we can introduce variables s 1 , . . . , s n with the constraints s i ≤ 1 and s i ≤ g i /(⟨X i , λ 0 (R)⟩ -y i ), and change the objective to maximize i∈[n] s i . That is, the following program also computes (ĝ(R), λ(R)): (ĝ(R), λ(R)) (11) = arg sup g∈R n ,λ∈R,s∈R n              i∈[n]\(B M ∪B δM ) s i X T g = 0 0 ≤ g i ≤ ⟨X i , λ⟩ -y i ∀i ∈ [n] : σ i = 1 ⟨X i , λ⟩ -y i ≤ g i ≤ 0 ∀i ∈ [n] : σ i = -1 s i ≤ 1 ∀i ∈ [n] \ (B M ∪ B δM ) s i ≤ g i /(⟨X i , λ 0 (R)⟩ -y i ) ∀i ∈ [n] \ (B M ∪ B δM )              (12) Lemma D.5. Let (w * , λ * ) be an optimal solution to Program 1. Suppose that the event of Lemma D.3 holds (with respect to λ * ). Then either max R B δM (R) > ϵn, or V -12ϵn -1 ≤ max R V (R) ≤ V. Proof. For the RHS, observe that for any region R, if we define w ∈ R n by w i = ĝ(R) i /(⟨X i , λ(R)⟩ -y i ), with the convention that 0/0 = 1, then the second and third constraints of Program 10 ensure that w ∈ [0, 1] n . Additionally, the first constraint ensures that X T i∈[n] w i (⟨ Xi , λ(R)⟩ -y i ) = 0. Thus, (w, λ(R)) is feasible for the original problem. This means that V (R) = ∥w∥ 1 ≤ V . To prove the lower bound, suppose that max R B δM (R) ≤ ϵn. Consider the specific region R containing the optimal parameter vector λ * (if there are multiple choose any), and let B M = B M (R) and B δM = B δM (R). Define g * ∈ R n by g * i = w * i (⟨X i , λ * ⟩ -y i ). Let B apx ⊂ [n] \ (B M ∪ B δM ) be the set of i ∈ [n] \ (B M ∪ B δM ) such that 1 1 + ϵ (⟨ Xi , λ 0 (R)⟩ -y i ) > ⟨ Xi , λ * ⟩ -y i ∨ ⟨ Xi , λ * ⟩ -y i > (1 + ϵ)(⟨ Xi , λ 0 (R)⟩ -y i ) . By Lemma D.3 (applied specifically to λ = λ 0 (R)), we know |B apx | ≤ ϵn. By assumption, we know that |B δM | ≤ ϵn. And we know that if i ∈ B M then |⟨ Xi , λ 0 (R)⟩ -y i | ≥ M , so the same holds for all λ ∈ R and in particular for λ * . Thus i∈B M w * i ≤ 1 M 2 i∈B M w * i (⟨ Xi , λ * ⟩ -y i ) 2 ≤ 1 M 2 i∈[n] w * i (⟨ Xi , λ * ⟩ -y i ) 2 ≤ 1 M 2 i∈[n] w * i (⟨X i , β (0) ⟩ -y i ) 2 ≤ 1 M 2 Xβ (0) -y 2 2 = 1 where the third inequality is because λ * ∈ OLS(X, y, w * ), and the equality is by definition of M . Finally, for any i ∈ [n] \ (B M ∪ B δM ∪ B apx ), we have min g * i ⟨X i , λ 0 (R)⟩ -y i , 1 ≥ min 1 1 + ϵ g * i ⟨X i , λ * ⟩ -y i , 1 = 1 1 + ϵ g * i ⟨X i , λ * ⟩ -y i . Therefore V = i∈[n] w * i = i∈[n]\(B M ∪B δM ∪Bapx) g * i ⟨X i , λ * ⟩ -y i + i∈B M g * ⟨X i , λ * ⟩ -y i + i∈B δM w * i + i∈Bapx w * i ≤ (1 + ϵ) i∈[n]\(B M ∪B δM ∪Bapx) min g * i ⟨X i , λ 0 (R)⟩ -y i , 1 + 1 + 2ϵn ≤ (1 + ϵ) i∈[n]\(B M ∪B δM ) min g * i ⟨X i , λ 0 (R)⟩ -y i , 1 + 1 + 2ϵn. The sum in the last line above is precisely the objective of Program 10 at (g * , λ * ). Moreover, (g * , λ * ) is feasible for Program 10 because X T g * = X T (w * ⋆(⟨ Xi , λ * ⟩-y i )) = 0 and g * i /(⟨ Xi , λ * ⟩-y i ) = w i ∈ [0, 1] for all i ∈ [n]. Thus, the optimal solution (ĝ(R), λ(R)) satisfies the inequality i∈[n]\(B M ∪B δM ) min g * i ⟨ Xi , λ 0 (R)⟩ -y i , 1 ≤ i∈[n]\(B M ∪B δM ) min ĝ(R) ⟨ Xi , λ 0 (R)⟩ -y i , 1 . Finally, let Bapx ⊆ [n] be the set of i ∈ [n] \ (B M ∪ B δM ) such that 1 (1 + ϵ) 2 (⟨ Xi , λ 0 (R)⟩ -y i ) > ⟨ Xi , λ(R)⟩ -y i ∨ ⟨ Xi , λ(R)⟩ -y i > (1 + ϵ) 2 (⟨X i , λ 0 (R)⟩ -y i ) . By Lemma D.3, the residuals at λ 0 (R) multiplicatively approximate the residuals at λ * except for ϵn samples, and the residuals at λ(R) also multiplicatively approximate the residuals at λ * except for ϵn samples. Thus, | Bapx | ≤ 2ϵn, so we have i∈[n]\B M min ĝ(R) ⟨X i , λ 0 (R)⟩ -y i , 1 = i∈[n]\(B M ∪B δM ∪ Bapx) min ĝ(R) ⟨X i , λ 0 (R)⟩ -y i , 1 + i∈B δM ∪ Bapx min ĝ(R) ⟨X i , λ 0 (R)⟩ -y i , 1 ≤ i∈[n]\(B M ∪B δM ∪ Bapx) min ĝ(R) ⟨X i , λ 0 (R)⟩ -y i , 1 + 2ϵn ≤ (1 + ϵ) 2 i∈[n]\(B M ∪B δM ∪ Bapx) ĝ(R) ⟨X i , λ(R)⟩ -y i + 2ϵn ≤ (1 + ϵ) 2 V (R) + 2ϵn. We conclude that V ≤ (1 + ϵ) 3 V (R) + (1 + ϵ)2ϵn + 1 + 2ϵn. Since V (R) ≤ n, simplifying gives V ≤ V (R) + 12ϵn + 1 as claimed. Using the above lemma in conjunction with Lemma D.3 immediately gives the desired theorem (from which Theorem 1.4 is a direct corollary). Theorem D.6. For any ϵ, δ, η > 0, there is an algorithm PARTITIONANDAPPROX with time complexity n + Cd ϵ 2 log n δ log 1 ϵη d+O(1) which, given ϵ, δ, η, and arbitrary samples (X i , y i ) n i=1 , either outputs ⊥ or an estimate Ŝ. If the output is ⊥, then the samples do not satisfy (ϵ, δ)-anti-concentration (Assumption A). Moreover, the probability that the output is some Ŝ such that | Ŝ -Stability(X, y)| > 12ϵn + 1 is at most η. Proof. The algorithm PARTITIONANDAPPROX does the following (see Appendix J for pseudocode). Let m = Cϵ -1 (d log ϵ -1 + log η -1 ) where C is the constant specified in Lemma D.3, and let M = Xβ (0) -y 2 where β (0) ∈ OLS(X, y, 1). Let E be the set of equations described in Section D.1, with respect to a uniformly random subset S ⊆ [n] of size m. Let R 1 , . . . , R p be the closed connected regions cut out by E. By Lemma D.1, we can enumerate R 1 , . . . , R p in time O(|E|) d+O(1) ; each is described by at most |E| linear constraints. For each R, we can find a representative λ 0 (R) ∈ R by solving a feasibility LP on R, and by solving n LPs on R, we can find a sign pattern σ ∈ {-1, 1} n such that σ i = 1 implies ⟨ Xi , λ⟩ -y i ≥ 0 for all λ ∈ R, and σ i = -1 implies ⟨ Xi , λ⟩ -y i ≤ 0 for all λ ∈ R. We also compute B M = {i ∈ [n] : |⟨ Xi , λ 0 (R)⟩ -y i | > M } and B δM {i ∈ [n] : |⟨ Xi , λ 0 (R)⟩ -y i | < δM/ √ n}. If |B δM | > ϵn, then return ⊥. Otherwise, compute (ĝ(R), λ(R)), the solution to Program 12, and compute V (R) = n i=1 ĝ(R)i ⟨ Xi, λ(R)⟩-yi . Finally, after iterating through all regions, output max i∈[p] V (R i ). Correctness follows from Lemmas D.3 and D.5. For each region, the time complexity is poly(|E|). As there are O(|E|) d regions, the overall time complexity is O(|E|) d+O(1) . Observing that |E| = O(n + mϵ -1 log(n/δ)) completes the proof. E PROOF OF THEOREM 1.5 In the previous section, we approximated the nonlinear program (1) by partitioning R d-1 into regions where the program could be approximated by a linear program. This approach had the disadvantage of requiring that in each region, the signs of the residuals ⟨ Xi , λ⟩ -y i were constant (so that the program could be reparametrized to have linear constraints), which necessitates making Ω(n d ) regions. In this section, we instead make use of the fact that for fixed λ, Program 1 is a linear program and therefore efficiently solvable. Our algorithm NETAPPROX is simply to (carefully) choose a finite subset N ⊂ R d-1 , solve the linear program for each λ ∈ N , and pick the best answer. The following lemma describes how to compute N , which will be an ϵ-net over R d-1 in an appropriate metric. Lemma E.1. For any (X i , y i ) n i=1 and γ > 0, there is a set N ⊆ R d-1 of size |N | ≤ (2 √ d/γ) d such that for any λ ∈ R d-1 with Xλ ̸ = y, there is some λ ′ ∈ N with Xλ ′ ̸ = y, and some σ ∈ {-1, 1}, such that Xλ -y Xλ -y 2 - Xλ ′ -y Xλ ′ -y 2 2 ≤ γ. Moreover, N can be computed in time O( √ d/γ) d . so that n-Stability(X, y) = sup λ∈R d-1 V (λ). For any fixed λ, we can compute V (λ) in polynomial time, since it is defined as an LP with n variables and d constraints. Fix γ = ϵδ 2 . The algorithm NETAPPROX does the following (see Appendix J for pseudocode): if X T ( Xλ-y) = 0 has a solution, then output 0. Otherwise, let N be the net guaranteed by Lemma E.1. Then compute V (λ) for every λ ∈ N , and output the estimate Ŝ = n -max λ∈N V (λ). Since the algorithm involves solving (2 √ d/γ) d linear programs, the time complexity is (2 √ d/γ) d • poly(n) as claimed. Next, since the algorithm is maximizing V (λ) over a subset of R d-1 , it's clear that Ŝ ≥ Stability(X, y). It remains to prove the upper bound on Ŝ. Recall from Lemma B.1 that for any λ ∈ R d-1 , V (λ) = inf u∈R d sup w∈[0,1] n ∥w∥ 1 n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0 . Let λ * be a maximizer of V (λ) and for notational convenience let k = V (λ * ). If Xλ * = y, then the algorithm correctly outputs 0. Otherwise, by the guarantee of Lemma E.1, there is some λ ∈ N with Xλ ̸ = y and σ ∈ {-1, 1} such that Xλ * -y Xλ * -y 2 -σ Xλ -y Xλ -y 2 2 ≤ γ. Pick any u ∈ R d . Since V (λ * ) = k, there is some w * = w * (σu) ∈ [0, 1] n such that ∥w * ∥ 1 ≥ k and n i=1 w * i (⟨ Xi , λ * ⟩ -y i )⟨X i , σu⟩ ≥ 0. Without loss of generality, there is at most one coordinate i ∈ [n] such that w i is strictly between 0 and 1. Also, the above inequality implies that n i=1 w * i ⟨ Xi , λ⟩ -y i Xλ -y 2 ⟨X i , u⟩ ≥ n i=1 w * i   ⟨ Xi , λ⟩ -y i Xλ -y 2 -σ ⟨ Xi , λ * ⟩ -y i Xλ * -y 2   ⟨X i , u⟩ ≥ - Xλ -y Xλ -y 2 -σ Xλ * -y Xλ * -y 2 2 ∥Xu∥ 2 ≥ -γ ∥Xu∥ 2 where the last two inequalities are by Cauchy-Schwarz and the guarantee of Lemma E.1, respectively. Now define w ∈ [0, 1] n by the following procedure. Initially set w := w * . Iterate through [n] in increasing order of (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ and repeatedly set the current coordinate w i := 0, until n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ ≥ 0. Obviously, this procedure will terminate with a feasible w. If it terminates after making t updates, then ∥w∥ 1 ≥ ∥w * ∥ 1 -t. Throughout the procedure, the sum (ϵ, δ)-strong anti-concentration. Similarly, the number of steps with |⟨X i , u⟩| < (δ/ √ n) ∥Xu∥ 2 is at most ϵn. And the number of steps with 0 < w * i < 1 is at most 1. Thus, ∥w∥ 1 ≥ ∥w * ∥ 1 -3ϵn -1. We conclude that V (λ) ≥ V (λ * ) -3ϵn -1, so Ŝ ≤ Stability(X, y) + 3ϵn + 1. n i=1 w i (⟨ Xi , λ⟩ -y i )⟨X i , u⟩ is non- decreasing. If |⟨ Xi , λ⟩ -y i | ≥ (δ/ √ n) Xλ -

F MOTIVATION FOR ANTI-CONCENTRATION ASSUMPTIONS F.1 SMOOTHING IMPLIES ANTI-CONCENTRATION

In this section, we show that anti-concentration (Assumption A) holds under the mild assumption that the response variable is smoothed. In the following proposition, notice that we always have the crude bound Xβ (0) -r 2 ≤ ∥r∥ 2 . Proposition F.1. Let σ > 0 and let (X i , r i ) n i=1 be arbitrary with X 1 , . . . , X n ∈ R d and r 1 , . . . , r n ∈ R. Suppose that Z 1 , . . . , Z n ∼ N (0, σ 2 ) are independent Gaussian random variables, and define y i = r i + Z i for i ∈ [n]. Then with probability at least 1 -n -d -2e -n , the dataset (X i , y i ) n i=1 satisfies (2d/n, δ)-anti-concentration where δ = min σ n √ d Xβ (0) -r 2 , 1 3n 3 √ 2d and β (0) ∈ OLS(X, r, 1). Proof. Without loss of generality, β (0) = (X T X) † X T r. Define β = (X T X) † X T y; we want to upper bound X β -y 2 . Let Q = X(X T X) † X T ; then y -X β ∼ N ((I -Q)r, σ 2 (I -Q)). But (I -Q)r = r -Xβ (0) , and N (0, σ 2 (I -Q)) is stochastically dominated by N (0, σ 2 I). So y -X β 2 ≤ r -Xβ (0) 2 + 2σ √ n with probability at least 1 -2e -n , by the tail bound for χ 2 random variables. Next, we claim that with high probability, for every set S ⊆ [n] of size |S| = 2d, there is no β ∈ R d such that |X i β -y i | ≤ σ/(n 3 √ 2d) for all i ∈ [S] . Fix some S. We bound the probability that for all β ∈ R d , i∈S (X i β -y i ) 2 ≤ σ 2 n 6 because this event contains the event that all residuals in S are at most σ/(n √ 2d). Now it suffices to restrict to the OLS estimator βOLS = (X T S X S ) † X T S y S . Now defining P = X S (X T S X S ) † X T S , we have y S -X S βOLS = (r S + Z S ) -P (r S + Z S ) = (I -P )(r S + Z S ) ∼ N ((I -P )r S , σ 2 (I -P )). Note that I -P is an orthogonal projection onto a space of dimension d ′ := |S|rank(X S ) ≥ d, so there is a matrix M ∈ R 2d×d ′ be such that M M T = I -P and M T M = I d ′ . Then we can write y S -X S βOLS = σM A + (I -P )r S = M (σA + M T r S ) for a random vector A ∼ N (0, I d ′ ). This means that y S -X S βOLS 2 2 = σA + M T r S 2 . But for any µ ∈ R, note that |ξ + µ| stochastically dominates |ξ| where ξ ∼ N (0, 1), so (ξ + µ) 2 stochastically dominates ξ 2 , and thus σA + M T r S 2 2 stochastically dominates ∥σA∥ 2 2 . Therefore Pr y S -X S βOLS 2 2 ≤ σ 2 n 6 ≤ Pr A 2 1 + • • • + A 2 d ′ ≤ 1 n 6 . For any 1 ≤ i ≤ d ′ , since the density of A i is bounded above by 1/ √ 2π, we have Pr[A 2 1 + • • • + A 2 d ≤ 1/n 6 ] ≤ d ′ i=1 Pr[|A i | ≤ 1/n 3 ] ≤ √ 2 n 3 √ π d ′ ≤ n -3d ′ . Recalling that d ′ ≥ d, this shows that for a fixed S ⊆ [n] of size 2d, with probability at least 1 -n -3d , there is no β ∈ R d with |X i β -y i | ≤ σ/(n 3 √ 2d) for all i ∈ S. A union bound over sets S of size 2d proves the claim. Finally, we use the above bounds to show that the smoothed data satisfies (2d/n, δ)-anti-concentration. We consider two cases. First, if Xβ (0) -r 2 ≤ σ √ n, then with probability 1 -2e -n we have X β -y 2 ≤ 3σ √ n, so for any β ∈ R d , the number of i ∈ [n] satisfying |⟨X i , β⟩ -y i | ≤ δ √ n X β -y 2 ≤ 1 3n 3 √ 2nd • X β -y 2 ≤ σ n 3 √ 2d is at most ϵn.

F.2 DISTRIBUTIONAL ASSUMPTIONS FOR STRONG ANTI-CONCENTRATION

In this section, we show that under reasonable distributional assumptions on the samples (X i , y i ) n i=1 , strong anti-concentration (Assumption B) holds with constant ϵ, δ > 0. First, it holds if the samples (X i , y i ) are i.i.d. and have an arbitrary multivariate Gaussian joint distribution. Proposition F.2. Let n, d, ϵ > 0 and set δ = ϵ/4. Let Σ : d × d be symmetric and positive-definite. Let Z 1 , . . . , Z n ∼ N (0, Σ) be independent and identically distributed. If n ≥ Cd log(n)/ϵfoot_1 for some absolute constant C, then with probability at least 1 -exp(-Ω(ϵ 2 n)), samples (Z i ) n i=1 satisfy (ϵ, δ)-strong anti-concentration, i.e. for all β ∈ R d , it holds that i ∈ [n] : |⟨Z i , β⟩| < δ √ n ∥Zβ∥ 2 ≤ ϵn where Z : n × d is the matrix with rows Z 1 , . . . , Z n . Proof. Let Σ = 1 n n i=1 Z i Z T i . Then by concentration of Wishart matrices, it holds with probability at least 1 -2 exp(-(n -d)/2) that Σ ⪯ 2Σ (see e.g. Exercise 4.7.3 in Vershynin (2018) ). In this event, ∥Zβ∥ 2 2 /n ≤ 2E|⟨Z 1 , β⟩| 2 for all β ∈ R d . Let F = {f β : β ∈ R d } be the class of binary functions f β (x) = 1[|⟨x, β⟩| 2 < (ϵ 2 /4)E|⟨Z 1 , β⟩| 2 ]. Observe that every function in F is the intersection of parallel half-spaces, so F has VC dimension O(d). Moreover, for any β, Ef β (Z 1 ) = Pr[|⟨Z 1 , β⟩| 2 < (ϵ 2 /4)β T Σβ] ≤ ϵ 2 since ⟨Z 1 , β⟩ ∼ N (0, β T Σβ), and Gaussian random variables are anti-concentrated (Pr(|ξ| < c) ≤ c for any c > 0 and ξ ∼ N (0, 1)). Thus, by the Vapnik-Chervonenkis bound and assumption on n, Pr sup f ∈F 1 n n i=1 f β (Z i ) > Ef β (Z 1 ) + ϵ/2 ≤ exp(O(d log n) -Ω(ϵ 2 n)) ≤ exp(-Ω(ϵ 2 n)).

So with probability at least

1 -exp(-Ω(ϵ 2 n)) it holds that for all β ∈ R d , i ∈ [n] : |⟨Z i , β⟩| 2 < (ϵ 2 /4)β T Σβ ≤ ϵn. This means that with probability at least 1 -exp(-Ω(ϵ 2 n)) -exp(-(n -d)/2), we have that for all β ∈ R d , i ∈ [n] : |⟨Z i , β⟩| 2 < (ϵ 2 /8) ∥Zβ∥ Then it is easy to see that the proof of Theorem 1.5 immediately extends to give the following result: Theorem G.3. For any ϵ, δ > 0, there is a ( √ d/(ϵδ 2 )) d • poly(n)-time algorithm which, given ϵ, δ, and samples (X i , y i , Z i ) n i=1 satisfying (ϵ, δ)-strong anti-concentration, returns an estimate Ŝ satisfying IV-Stability(X, y, Z) ≤ Ŝ ≤ IV-Stability(X, y, Z) + 3ϵn + 1.

H HEURISTIC FOR LOWER BOUNDING STABILITY

In this section we explain the "LP lower bound" which we applied in Section 5 to provide exact lower bounds on stability of various datasets. Given a list of thresholds T and a subset size m, we randomly pick a set S ⊆ [n] of size m, and enumerate the regions R 1 , . . . , R p defined by the hyperplanes ⟨ Xi , λ⟩ -y i = t ∀i ∈ S, ∀t ∈ T. Fix one such region R. Similar to PARTITIONANDAPPROX, we use the change of variables g i = (1 -w i )(⟨ Xi , λ⟩ -y i ), and enforce the constraint X T g = X T ( Xλ -y). For some of the i ∈ [n], the residual ⟨ Xi , λ⟩ -y i will have constant sign on the entire region. For these samples, we enforce the constraint 0 ≤ g i ≤ ⟨ Xi , λ⟩ -y i (if the residual is non-negative) or ⟨ Xi , λ⟩ -y i ≤ g i ≤ 0 (if the residual is non-positive). However, because we didn't include a hyperplane for every sample, it's likely that for some i ∈ [n], the residual attains both signs within the region, in which case the constraint w i ∈ [0, 1] is not convex in g and λ. Thus, we relax the constraint to inf λ∈R ⟨ Xi , λ⟩ -y i ≤ g i ≤ sup λ∈R ⟨ Xi , λ⟩ -y i . Note that this is indeed a relaxation, because the interval contains 0. Finally, let K + ⊆ [n] be the set of indices for which the residual is non-negative on R, and let K -⊆ [n] be the set of indices for which the residual is non-positive. Then we minimize the objective i∈K+ g i sup λ∈R ⟨ Xi , λ⟩ -y i + i∈K- g i inf λ∈R ⟨ Xi , λ⟩ -y i . Because this objective is less than or equal to i∈[n] g i /(⟨ Xi , λ⟩ -y i ), and because we only relaxed constraints, this program has value at most V R (the value of the exact non-linear program restricted to λ ∈ R). Compared to the provable approximation algorithm described previously, this algorithm is a heuristic because each region will typically have samples for which the residual changes sign within the region. As rough intuition, if the residual ⟨ Xi , λ⟩ -y i remains fairly small throughout the region then the relaxation of the constraint on w i may not be problematic. However, if the residual can attain large magnitude in the region, then relaxing the constraint on w i may significantly change the value of the program. However, we may expect that if the partition is sufficiently fine, then a region where some residual is allowed to blow up may have many samples whose residuals are forced to be large. This motivates the use of an additional heuristic to refine the certification algorithm: if (w * , β * ) are the optimal solution to the sensitivity problem, and β (0) = OLS(X, y, 1) is the original regressor, then it must hold that n i=1 w * i (⟨X i , β * ⟩ -y i ) 2 ≤ n i=1 w * i (⟨X i , β (0) ⟩ -y i ) 2 ≤ n i=1 (⟨X i , β (0) ⟩ -y i ) 2 . However, if R is the region containing the optimal solution and if ∥w * ∥ 1 ≥ n -k, then n i=1 w * i (⟨X i , β * ⟩-y i ) 2 ≥ inf S⊆[n]:|S|=n-k i∈S (⟨X i , β * ⟩-y i ) 2 ≥ inf S⊆[n]:|S|=n-k i∈S inf λ∈R (⟨ Xi , λ⟩-y i ) 2 . Thus, if we compute Q i (R) = inf λ∈R (⟨ Xi , λ⟩ -y i ) 2 (by computing the interval of achievable residuals) for each i ∈ [n], then we can lower bound k (conditioned on R containing the optimal solution) by sorting Q 1 (R), . . . , Q n (R) and finding the smallest subset K ⊆ [n] such that i̸ ∈K Q i (R) ≤ Xβ (0) -y 2 2 . Boston Housing dataset. For the two-dimensional datasets, we applied the net upper bound with 100 trials and the LP lower bound with L = {0} and m = 100. For the three-dimensional dataset we applied the net upper bound with 1000 trials and the LP lower bound with L = {0} and m = 30.

I.3 INDEPENDENT TRIALS & ERROR BARS

For randomized algorithms with two-sided errors, standard practice is to run the algorithm multiple times on the same dataset and report e.g. the median and error bars. However, our algorithms provide unconditional upper/lower bounds on the stability, so this is not necessary. If we did run the LP lower bound multiple times, we could simply report the maximum outcome, and this would be a valid lower bound on the stability; similarly for the net upper bound we could report the minimum outcome. However, for our synthetic two-dimensional datasets (Figures 1 and 2a ), because the data itself is random, we construct each dataset 10 independent times. For each algorithm and each dataset we compute the stability bound. Then, we report the median bound and error bars across the 10 independent-but-identically-constructed datasets. All error bars are 25th and 75th percentiles. There are no error bars in our synthetic three-dimensional experiment (Figure 2b ) because we only conducted one trial per noise level (rather than 10) due to computational constraints.

I.4 COMPUTATIONAL DETAILS

All experiments were done in Python on a Microsoft Surface Laptop, using GUROBI (Gurobi Optimization, LLC, 2022) with an Academic License to solve the linear programs. Each plot took at most 30 hours to generate (specifically, the three-dimensional Gaussian isotropic data experiment took 3 hours for each of the 10 datasets, dominated by the time required for the LP lower bound algorithm; other plots were faster). I.5 OMITTED EXPERIMENT: COVARIANCE SHIFT Consider a dataset with n samples (X i , y i ) drawn from X i ∼ N (0, Σ) and y i = -X i1 + X i2 , where Σ = 1 -1 -1 2 . Additionally, there are k + 1 outliers, in two types: the k type-I outliers (X i , y i ) have X i = c(1, -3) and y i = -C large and negative; the one type-II outlier (X i , y i ) has X i = √ n(1, 1) and y i chosen to exactly lie on the OLS best-fit hyperplane y = ⟨x, β⟩. Initially, β1 > 0, and clearly removing the last k + 1 samples suffices to flip the sign. However, the initial sample covariance is roughly Σ = Σ + 11 T = 2 0 0 3 . The influence of a sample (X i , y i ) on coordinate j of the OLS regressor β is ⟨( Σ-1 ) j , X i ⟩ • (y i -⟨X i , β⟩). As a result, the type-I outliers have negative influence on β1 , so the greedy algorithm initially does not remove them. The influence only becomes positive after removing the type-II outlier, because this shifts the sample covariance to Σ, and therefore flips the sign of ⟨( Σ-1 ) 1 , X i ⟩. Constructing this example experimentally requires some care in the choices of k, c, and C: we need the total influence of type-A outliers (proportional to kcC) large enough that β1 > 0, and need kc 2 small enough that the type-I outliers don't affect the sample covariance much. Moreover, the number of trials needed by the net algorithm roughly scales with C. We take n = 1000, k = 30, c = 0.2, and C = 300. Applying the greedy baseline and the net upper bound (with 1000 trials), we find that the former removes 97 samples while the latter removes roughly 16.7 samples. In this example, the failure of the greedy baseline can be attributed to the constant-factor shift in the sample covariance achieved by removing the type-B outlier. We plot the OLS regressor on all samples (in blue), the OLS regressor on S (in orange), and the OLS regressor on S c (in green).

J FORMAL PSEUDOCODE FOR ALGORITHMS

Algorithm 1 NETAPPROX 1: procedure NETAPPROX((X i , y i ) n i=1 , ϵ, δ) 2: X ← [(X T ) 2 ; . . . , (X T ) d ] 3: if X T ( Xλ -y) = 0 has a solution λ then M ← Xβ (0) -y 2 4: X ← [(X T ) 2 ; . . . ; (X T ) d ] 5: m ← Cϵ -1 (d log ϵ -1 + log η -1 ) ▷ C > 0 is a universal constant for R ∈ R do ← arg sup for R ∈ R do 11: g∈R n ,λ∈R,s∈R n                i∈[n]\(B M ∪B δM ) s i X T g = 0 0 ≤ g i ≤ ⟨ Xi , λ⟩ -y i ∀i ∈ [n] : σ i = 1 ⟨ Xi , λ⟩ -y i ≤ g i ≤ 0 ∀i ∈ [n] : σ i = -1 s i ≤ 1 ∀i ∈ [n] \ (B M ∪ B δM ) s i ≤ g i /(⟨ Xi , λ 0 (R)⟩ -y i ) ∀i ∈ [n] \ (B M ∪ B δM )               For each i ∈ [n], compute l i = inf λ∈R ⟨ Xi , λ⟩ -y i and r i = sup λ∈R ⟨ Xi , λ⟩ -y i . 12: Set K + = {i ∈ [n] : l i ≥ 0} and K -= {i ∈ [n] : r i ≤ 0} \ K -. 13: Solve linear program Ŝ(R) ← inf g∈R n ,λ∈R            i∈K+ g i r i + i∈K- g i l i X T g = X T ( Xλ -y) 0 ≤ g i ≤ ⟨ Xi , λ⟩ -y i ∀i ∈ K + ⟨ Xi , λ⟩ -y i ≤ g i ≤ 0 ∀i ∈ K - l i ≤ g i ≤ r i ∀i ∈ [n] \ (K + ∪ K -)            . 14: For each i ∈ K + ∪ K -let Q i = min(l 2 i , r 2 i ); for each i ∈ [n] \ (K + ∪ K -) let Q i = 0. 



This is in the real number model; a similar statement can be made in the bit complexity model. /n ≤ ϵn as desired.Second, strong anti-concentration holds if the samples are drawn from a mixture of centered Gaussian distributions, with arbitrary weights, so long as each covariance matrix has bounded largest and smallest eigenvalues.



Figure 1: Heterogeneous data

Figure 3: Results from Boston Housing dataset. Figure (a) plots the net upper bounds on the y-axis against the greedy upper bounds on the x-axis; Figure (b) plots the LP lower bounds on the y-axis against the net upper bounds on the x-axis. In both (a) and (b), each mark corresponds to one of the 156 feature pairs. Figure (c) plots the feature zn against the feature crim (on log scale); each mark is one of the 506 datapoints.

[h(x) ̸ = y] ≤ ϵ. Specifically, for any λ ∈ R d-1 , we define a function f λ : [n] → {0, 1} as the indicator function of samples for which the residual at λ is close to the residual at λ * . Then the region containing λ * is precisely the set of functions which perfectly fit S × {1}, and the generalization bound implies that all such functions fit most of [n] × 1, which is what we wanted to show. The following lemma formalizes this argument, with some additional steps to deal with very small and very large residuals. Lemma D.3. Let λ * ∈ R d-1 and let η > 0. Let R be the region containing λ * . Let B M ⊆ [n] be the set of i ∈ [n] such that |⟨ Xi , λ * ⟩ -y i | > M , and let B δM ⊆ [n] be the set of

Figure 4: Coefficients of OLS regression for the suburb/city split (Figure (a)) and a random split with the same ratio (Figure (b)).In both cases, we first rescale all covariates to mean 0 and variance 1, and add a constant variable (the last coefficient in both plots). In Figure (a), we then plot the OLS regressor on all samples (in blue), the OLS regressor on all 134 zn > 0 samples (in orange), and the OLS regressor on all 472 zn = 0 samples (in green). In Figure (b), we pick a subset S of samples of expected size 134. We plot the OLS regressor on all samples (in blue), the OLS regressor on S (in orange), and the OLS regressor on S c (in green).

← [(X T ) 2 ; . . . ; (X T ) d ; -y], d ′ ← rank(A) 7:U, D, V ← SVD(A) (so that A = U DV T and D is diagonal matrix of the d ′ nonzero singular valuesγ-net for S d ′ -1 by M ← {R(±m 1 , ±m 2 , . . . , ±m d ′ ) : m 1 , . . . , m d ′ ∈ {0, γ/ √ d ′ , 2γ/ √ d ′ , . . . , 1}}where R : d ′ × d ′ is uniformly random rotation matrix 10:Define γ-net for "residual space" (Lemma E.1regions R ∩ {⟨v i , x⟩ ≤ c i } and R ∩ {⟨v i , x⟩ ≥ c i } to R PARTITIONANDAPPROX((X i , y i ) n i=1 , δ, ϵ, η) 2: β (0) ← arg min β∈R d ∥Xβ -y∥ 2 2▷ X : n × d matrix with rows X 1 , . . . , X n 3:

M ← {i ∈ [n] : |⟨ Xi , λ 0 (R)⟩ -y i | > M } 18: B δM ← {i ∈ [n] : |⟨ Xi , λ 0 (R)⟩ -y i | < δM/ if inf λ∈R ⟨ Xi , λ⟩ -y i ≥ 0 then

1) ≤ • • • ≤ Q (n) be the sorted numbers Q 1 , . . . , Q n and compute N (R) ← sup k

Lemma D.1. Each region R i is the intersection of O(|E|) linear equalities or inequalities. Moreover, the regions R 1 , . . . , R p can be enumerated in time O(|E|)d+O(1) . Proof. Order the set of equations E arbitrarily. We recursively construct the set of regions demarcated by the first t equations. For each such region, we solve a linear program to check if the t + 1th hyperplane intersects the interior of the region. If so, we split the region according to the sign of the t + 1th hyperplane. The overall time complexity of this procedure is O(p • poly(n, |E|)), where p is the final number of regions. But the number of regions which can be cut out by t hyperplanes in R d is at most O(t) d by standard arguments.Next, we argue that even though we only multiplicatively partitioned a small subset of the residuals, with high probability most residuals are well-approximated. More precisely, we show that for the (random) region R containing a fixed point λ * , with high probability, for every other λ in the region, most residuals at λ are multiplicatively close to the corresponding residuals at λ * . This can be proven by the generalization bound for function classes with low VC dimension: Theorem D.2 (Theorem 3.4 inKearns & Vazirani (1994)). Let X be a set. Let C ⊂ {0, 1} X be a binary concept class with VC dimension d. Let D be a distribution on X × {0, 1}. Let ϵ, δ > 0 and pick m ∈ N satisfying

y

Sample i 1 , . . . , i m ∈ [n] without replacement

 X ← [(X T ) 2 ; . . . ; (X T ) d ]

annex

Proof. Let A : n × d be the matrix with columns (X T ) 2 , . . . , (X T ) d , -y. Let d ′ = rank(A) and let A = U DV T be the singular value decomposition of A, where D = diag(s 1 , . . . , s d ′ ) is the diagonal matrix of nonzero singular values of A. Let M be a "marginally-random" γ-net over the unit sphere S d ′ -1 under the ℓ 2 metric, where by "marginally-random" we mean that M is chosen from some distribution, and every point of the net has marginal distribution uniform over S d ′ -1 (e.g. take any fixed γ-net and apply a uniformly random rotation). Also define B = V D -1 . Suppose B d is not identically zero. Then defineSince each m ∈ M is a generic unit vector (by the marginally-random property) and B d is nonzero, we have that all (Bm) d are nonzero with probability 1, so the above set is well-defined. Moreover, for any m ∈ M,as desired. On the other hand, if B d is identically zero, then V d = 0, so y = (A T ) d = 0. This boundary case can be avoided by picking any nonzero covariate (X T ) i among (X T ) 2 , . . . , (X T ) d and replacing y by y + c(X T ) i for a generic c ∈ R; this does not change Stability(X, y).. Thus, N can be constructed in time O((All that is left is to show that under strong anti-concentration, the value of the linear program is Lipschitz in λ (under the metric described in the previous lemma). This proves Theorem 1.5. Theorem E.2. Let ϵ, δ > 0. There is an algorithm NETAPPROX with time complexity. . , Σ k be symmetric and positive-definite matrices. Suppose that there are constants λ, Λ > 0 such that λI ⪯ Σ 1 , . . . , Σ k ⪯ ΛI, and define κ = Λ/λ. For arbitrary weights w 1 , . . . , w k ≥ 0 with wProof. The proof is similar to that of the previous proposition. First, note that each Z i can be coupled with ξ ∼ N (0, ΛI) so that, and for any β,. By the Vapnik-Chervonenkis bound, with probability at least 1 -exp(-Ω(ϵ 2 n)), we get that for allCombining with the upper bound on ∥Zβ∥ 2 2 , we have with probability at least

G EXTENSION TO IV LINEAR REGRESSION

We extend the definition of stability to measure the stability of the sign of a coefficient of the IV linear regressor: Definition G.1. For samples (X i , y i , Z i ) n i=1 with covariates X i ∈ R d , response y i ∈ R d , and instruments Z i ∈ R p , the ordinary IV estimator set with weight vector w ∈The finite-sample stability of (X i , y i , Z i ) n i=1 is then defined IV-Stability(X, y, Z) := infFor any k, the expression IV-Stability(X, y, Z) ≥ k is still defined as a bilinear system of equations in β and w. Our exact algorithm for Stability(X, y) never uses that the weighted residual (w ⋆ (Xβ -y)) is multiplied by X T in the OLS solution set, rather than some arbitrary matrix Z T ; all that matters is that this matrix has at most d rows. Thus, with a bound on the number of instruments, the algorithm generalizes to computing IV-Stability(X, y, Z):Theorem G.2. There is an n O(dp(d+p)) -time algorithm which, given n arbitrary samples (X i , y i , Z i ) n i=1 with X 1 , . . . , X n ∈ R d and Z 1 , . . . , Z n ∈ R p and y 1 , . . . , y n ∈ R, and given k ≥ 0, decides whether IV-Stability(X, y, Z) ≤ k.The algorithm NETAPPROX also generalizes to IV regression. To state the guarantee, we define an anti-concentration assumption for IV regression data. Assumption C. Let ϵ, δ > 0. We say that samplesTo summarize the algorithm, for each region we compute the maximum of the LP lower bound and the residual lower bound, and the output of the algorithm is the minimum over all regions. See Algorithm 4 in Appendix J for detailed pseudocode.

I FURTHER EXPERIMENTAL DETAILS I.1 IMPLEMENTED ALGORITHMS

Net upper bound. We implement NETAPPROX with one modification: instead of deterministically picking the net M by discretization (see Lemma E.1), we let M be a set of random unit vectors from S d ′ -1 , and then compute N as in Lemma E.1. Instead of parametrizing the algorithm by the desired approximation error ϵ, we parametrize by |M|. Despite this change, the algorithm still provides a provable, exact upper bound on Stability(X, y).LP lower bound. The disadvantage of NETAPPROX is that it only lower bounds the stability under an assumption that seems hard to check. The PARTITIONANDAPPROX algorithm is better, because it unconditionally, with high probability, outputs either an accurate estimate or a failure symbol ⊥. However, the Ω(n d ) time complexity (needed so that in each region, all n residuals have constant sign) may be prohibitively slow in practice. For this reason, we introduce a heuristic simplification of PARTITIONANDAPPROX which provably lower bounds the stability with no assumptions.At a high level, we decrease the number of regions by ignoring the requirement that within each region all residuals have constant sign. The algorithm is parametrized by a list of thresholds L and a subset size m, and the regions are demarcated by the hyperplanes ⟨ Xi , λ⟩ -y i = t for m random choices i ∈ [n] and all t ∈ L. Now, for samples which do not have constant residual in a particular region R, the constraints w i ∈ [0, 1] are nonconvex after the change of variables. We relax these constraints to linear constraints, and relax the objective function to skip the "bad" samples. Heuristically, this relaxation should not lose too much on samples where the residual remains small throughout R, but may be problematic if the residual blows up. This motivates the use of a complementary lower bound heuristic based on the minimum squared-loss achievable by any λ ∈ R. See Appendix H for details.Baseline greedy upper bound (Broderick et al., 2020; Kuschnig et al., 2021) . We implement the greedy algorithm described by Kuschnig et al. (2021) , which refines the algorithm of Broderick et al. (2020) : iteratively remove the sample with the largest local influence until the sign of the first coefficient of the OLS is reversed. After each step, recompute the influences.Baseline lower bound. This algorithm simply computes the squared-loss-based lower bound (used in our full lower bound algorithm) for each region. See Appendix H for details.

I.2 HYPERPARAMETER CHOICES

The net upper bound has only one hyperparameter (the number of trials), which should be chosen as large as possible subject to computational constraints. The LP lower bound has two hyperparameters (the size of the random subsample, and the set of thresholds). We choose these ad hoc subject to our computational constraints. Experiments to determine the optimal tradeoff between the subsample size and threshold set could be useful. However, we note that because the LP lower bound unconditionally lower bounds the stability, no matter what hyperparameters we choose, in practice it suffices to try several sets of hyperparameters and compute the maximum of the resulting lower bounds.Heterogeneous data experiment. For each dataset, we applied the net upper bound with 1000 trials, the LP lower bound with L = {-0.01, 0, 0.01} and m = 30, and the baseline lower bound with L = {-0.01, 0, 0.01} and m = 1000.Isotropic Gaussian data. For each dataset with noise level σ, we applied the net upper bound with 10 d trials, the LP lower bound with L = {-σ, 0, σ} and m = 30, and the baseline lower bound with L = {-σ, 0, σ} and m = 1000.

