LIMITS OF ALGORITHMIC STABILITY FOR DISTRIBU-TIONAL GENERALIZATION Paper under double-blind review

Abstract

As machine learning models become widely considered in safety critical settings, it is important to understand when models may fail after deployment. One cause of model failure is distribution shift, where the training and test data distributions differ. In this paper we investigate the benefits of training models using methods which are algorithmically stable towards improving model robustness, motivated by recent theoretical developments which show a connection between the two. We use techniques from differentially private stochastic gradient descent (DP-SGD) to control the level of algorithmic stability during training. We compare the performance of algorithmically stable training procedures to stochastic gradient descent (SGD) across a variety of possible distribution shifts -specifically covariate, label, and subpopulation shifts. We find that models trained with algorithmically stable procedures result in models with consistently lower generalization gap across various types of shifts and shift severities as well as a higher absolute test performance in label shift. Finally, we demonstrate that there is there is a tradeoff between distributional robustness, stability, and performance.

1. INTRODUCTION

As machine learning (ML) is applied in several high-stakes decision making situations such as healthcare (Ghassemi et al., 2017; Rajkomar et al., 2018; Zhang et al., 2021a) and lending (Liu et al., 2018; Weber et al., 2020) , it is important to consider scenarios when models fail. Typically, models are trained with empirical risk minimization (ERM), which assumes that the training and test data are sampled i.i.d from the same underlying distribution (Vapnik, 1999) . Unfortunately, this assumption means that ERM is susceptible to performance degradation under distribution shift (Nagarajan et al., 2021) . Distribution shift occurs when the data distribution encountered during deployment is different, or changes over time while the model is used. In practice, even subtle shifts can significantly affect model performance (Rabanser et al., 2019) . Given that distribution shift is a significant source of model failure, there has been much work directed toward improving model robustness to distribution shifts (Taori et al., 2020; Cohen et al., 2019; Engstrom et al., 2019; Geirhos et al., 2018; Zhang et al., 2019; Zhang, 2019) . One concept recently introduced to improve model robustness is distributional generalization (Kulynych et al., 2022; Nakkiran & Bansal, 2020; Kulynych et al., 2020) . Distributional generalization (DG) extends classical generalization to encompass any evaluation function (instead of just the loss objective) and allows the train and test distributions to differ. Kulynych et al. (2022) prove that algorithms which satisfy total variation stability (TV stability) bound the gap between train and test metrics when distribution shift is present, i.e., algorithms which satisfy TV stability are also satisfy DG. This motivates the use of techniques from differentially private (DP) learning to satisfy DG, since DP implies TV stability (Kulynych et al., 2022) . We know from other works that DP learning often comes at a cost to accuracy (Bagdasaryan et al., 2019; Suriyakumar et al., 2021; Jayaraman & Evans, 2019) . Unfortunately these works don't thoroughly explore the empirical implications of their theorems across a wide variety of settings except for a positive result in Suriyakumar et al. (2021) . Because robustness to new settings is an important question for deployments of models, it is important to understand how the theory of distributional robustness will work practically when facing different types and severities of shifts. Furthermore, it is hard to understand from the current theory how practitioners should tune the level of stability as to achieve high performing models. In this paper we conduct an extensive empirical study on the impact of using algorithmically stable learning strategies for robustness when facing distribution shift. Stable learning (SL) refers to approaches that constrain the model optimization objective or learning algorithm to improve model stability. We focus on two questions regarding the use of SL for DG in practice: (i) Under what types of shift is SL more robust and accurate than ERM? (ii) Are SL trained models consistently robust across all hyperparameters, model architectures, and shift severities? We target four common examples of shift: covariate (Shimodaira, 2000) , label (Lipton et al., 2018; Storkey, 2009) , subpopulation (Duchi & Namkoong, 2021; Koh et al., 2021) , and natural shifts Taori et al. (2020) . We use state of the art models and large benchmark datasets focusing on realistic prediction tasks in object recognition, satellite imaging, biomedical imaging, and clinical notes (see Table 1 , with details in Section 4.2). The primary comparison we make is through the generalization gap, defined as the difference in model performance between training and testing (Zhang et al., 2021b) . Under extensive experimentation-incorporating 32 distinct types of distribution shift and 5 severity levels-we find: 1. SL improves both accuracy and robustness for label and natural shifts. 2. SL has a robustness-accuracy tradeoff for covariate and subpopulation shift. 3. The tradeoffs of SL are consistent across different shift severities, model architectures, and hyperparameter settings.

2. RELATED WORK

Many approaches have been developed in pursuit of robustness to distribution shift, including: domain adaption (Wang & Deng, 2018) , out-of-distribution detection (Yang et al., 2021) , adversarial training (Madry et al., 2018; Ilyas et al., 2019) , as well as through algorithmic improvements (Sagawa et al., 2019) . To solve the distribution shift problem, many recent techiques for distributionally robust optimization (DRO), such as risk averse learning (Curi et al., 2020) , have been developed. However, many of these methods do not perform better than ERM (Pfohl et al., 2022) and involve complex implementations, making them difficult to use. Algorithmic stability has also been explored to improve distributional robustness. It is often easier to implement, with simpler methods such as: ℓ 2 regularization (Wibisono et al., 2009) , early stopping (Hardt et al., 2016) , and differentially private stochastic gradient descent (DP-SGD) (Abadi et al., 2016) . Early stopping and ℓ 2 regularization have already been studied for their potential to improve distributional robustness (Sagawa et al., 2019) . However, it's difficult to conduct finegrained analyses into improved robustness with these methods because their stability is not directly controllable. This motivates our use of DP-SGD to investigate the limits of stability for DG since we can control the level of stability by adjusting the noise multiplier σ in DP-SGD. While algorithmic stability has been explored theoretically in previous works (see Section 3), we explore it empirically in this paper across various synthetic and natural distribution shifts.

3. BACKGROUND AND NOTATION

We provide an overview of the connections between algorithmic stability, DP, and different forms of generalization in this section. It is well-established that algorithmic stability implies generalization in the traditional ERM setup (Bousquet & Elisseeff, 2002) . Additional work has proven that DP implies stability and thus, implies generalization (Bassily et al., 2016; Dwork et al., 2015) . In this section we define these concepts and draw connections between them. This is done to clarify the theoretical implication that DP leads to improved distributional robustness. Notation We assume there is a training dataset D train = {(x i , y i )} n i=1 of labeled examples such that D train ∼ D and a testing dataset D test = {(x i , y i )} m i=1 of labeled examples such that D test ∼ D ′ . Given D train , we use a randomized learning algorithm M(D train ) to learn parameters θ ∈ Θ of a model relating the datapoints {x i } to their corresponding label {y i }. Now we will describe differential privacy and its links to stability. Definition 1 (Differential Privacy (Dwork et al., 2006) ). Suppose we have two datasets D, D ′ which have a Hamming distance (the number of examples which the two databases differ by) of 1, then an algorithm M(D) is (ϵ, δ)-differentially private if: Pr[M(D) ∈ Θ] ≤ exp(ε)Pr[M(D ′ ) ∈ Θ] + δ (1) where Θ ⊆ Range(M). By definition, DP guarantees privacy by bounding the effect that any individual datapoint has on the output of M. This leads to DP implying strong forms of stability such as TV stability (Bassily et al., 2016) and uniform stability (Wang et al., 2016) . Next, we will define these notions of algorithmic stability and show how DP implies them both. Definition 2 (TV Stability). Suppose we have two datasets D, D ′ which have a Hamming distance of 1, then an algorithm M(D) is (δ)-TV stable if: Pr[M(D) ∈ Θ] ≤ Pr[M(D ′ ) ∈ Θ] + δ (2) We can also write this as the an upper bound on the total variation distance between two distributions P and Q where D ∼ P and D ′ ∼ Q between M(D) and M(D ′ ). Where d T V (P, Q) = sup T |P (T ) -Q(T )|, then TV stability is d T V (D, D ′ ) ≤ δ. Definition 3 (Uniform Stability (Bousquet & Elisseeff, 2002) ). Suppose we have two datasets D, D ′ which have a Hamming distance of 1, then an algorithm M (D) is (δ)-uniformly stable if: ∀D, D ′ s. t.∥D-D ′ ∥=1 |Eℓ(D; M(D)) -Eℓ(D; M(D ′ ))| ≤ δ (3) If the loss function ℓ is bounded between [0,1] then TV stability implies δ-uniform stability (Kulynych et al., 2022) . All of these definitions assume we are sampling D and D ′ from the same underlying distribution. When distribution shift ocurrs, this is no longer true. Thus, we will present the results of Kulynych et al. (2022) who demonstrate that TV-stable algorithms satisfy a notion of generalization that captures distribution shift known as distributional generalization. Definition 4 (Distributional Generalization (Kulynych et al., 2022; Nakkiran & Bansal, 2020; Kulynych et al., 2020) ). Given two datasets D and D ′ sampled from two different distributions P and Q, an algorithm M(D) satisfies δ-distributional generalization (DG) if for all ϕ: Kulynych et al. (2022) prove that any algorithm which is δ-TV stable is δ-DG. Thus implying that algorithmic stability improves robustness to distribution shift. The level of TV stability and DG which are parameterized by δ are directly correlated with level of noise σ we use in DP-SGD. D × Θ → [0, 1] | E D ∼P ϕ(D; M(D)) - E D ′ ∼P,D ′ ∼Q ϕ(D ′ ; M(D))| ≤ δ (4) Throughout the rest of the paper, the larger σ is the more stable the algorithm is (i.e. the lower δ is).

4. METHODS

To better understand the potential and limits for using algorithmic stability to improve model robustness we conduct a thorough empirical study across several datasets, types of shifts, and shift severities. We explore both synthetic and natural distribution shifts that arise due to differences in the covariate, label, and subpopulation distributions between training and testing (Table 1 ). Our empirical investigation covers more than 200 experiments, incorporating 32 distinct forms of distribution shift with varying levels of severity. In each of our experiments-given a training dataset D train and test dataset D test with known distribution shift-we compare the difference in generalization gap (Definition 5) of models with and without algorithmic stability ( "stable learning" (SL)) and ERM respectively. Characterizing this gap is an important step to determine how well the theoretical guarantees of DG hold in practice. We aim to provide answers to the following critical questions about practical use of SL for DG, motivating its use beyond theory: (Hendrycks & Dietterich, 2019) . For synthetic label shift, we use Imbalanced-CIFAR10, which we created by inducing a class imbalance in CIFAR10 to create a shift in P (Y ). These shifts were created randomly, where the percentage of of samples in the shifted dataset from the original test dataset was was chosen randomly from 10 -100%. To explore synthetic subpopulation shifts, we use the Waterbirds dataset (Sagawa et al., 2019) made up of bird images with synthetic backgrounds. (i) Natural datasets: Our natural covariate shifts are derived from the Cells-Out-of-Sample (COOS) dataset, which consists of mouse cell images of 7 biological classes, with 4 separate test sets of increasing degrees of covariate shift (Lu et al., 2019) . We also explore natural subpopulation and label shifts with the PovertyMap dataset predicting poverty levels from satellite imagery (Koh et al., 2021) , and MIMIC-III clinical notes predicting mortality (Johnson et al., 2016) . For more detailed information about the datasets used in this paper, please refer to Appendix A.2.

4.3. MODEL TRAINING

We train models using DP-SGD (with varying levels of noise and clipping as a way to modify the amount of stability, as detailed in Appendix A.1.2 and Appendix A.3) and ERM for each dataset and type of shift. Each model is trained with early stopping on the validation loss to prevent overfitting. The scale of the hyperparameter search and noise levels we used for determining the best performing models can be found in Appendix A.3. For all experiments, we use the Opacus package (Yousefpour et al., 2021) • Measure the generalization gap G, the difference between training and testing accuracy for SL and ERM models, G SL and G ERM . To measure the improvement SL has over ERM, we report the difference in generalization gap ∆ G as defined in Definition 5 below.

4.5. EVALUATION

We design our empirical investigation to answer the two previous questions posed at the outset of Section 4 on the limits of using SL for distributional robustness. We address them as follows: (i) The metric we use to primarily compare the robustness of models is the difference in generalization gap, ∆ G (Definition 5). If ∆ G > 0, this indicates that the model trained with a stable learning algorithm has a lower generalization gap and is therefore more robust than ERM trained models. As mentioned in Section 4.4, we test across various levels of stability to find the optimal σ value.  ∆ G = E D ∼P ϕ(D; M ERM (D)) - E D ∼P,D ′ ∼Q ϕ(D ′ ; M ERM (D)) -E D ∼P ϕ(D; M A (D)) - E D ∼P,D ′ ∼Q ϕ(D ′ ; M A (D)) We calculate ∆ G for SL/DRO compared to ERM , referred to SL ∆ G /DRO ∆ G . (ii) We investigate if stability holds over different shift severities using the synthetic datasets by evaluating for which shifts is ∆ G > 0. Shift severity is characterized as the distance between D train Table 3 : SL demonstrates an accuracy-robustness tradeoff in the Waterbirds datasets representing synthetic subpopulation shifts. We observe that for all shifts, ∆ G > 0, indicating that each SL model is more robust than ERM. However, there is a consistent loss in accuracy. We present results across five shift severities for σ = 0.1. and D test . We use the Optimal Transport Dataset Distance (OTDD) (Alvarez-Melis & Fusi, 2020) . We choose this metric as opposed to other dataset distances for its provable guarantees and that it allows for completely disjoint datasets to be compared. We categorize our synthetic datasets by shift severity by first normalizing the computed OTDD and sorting them into quintiles 1-5. (iii) To explore the tradeoff between robustness and accuracy, we also report model performance throughout the paper and compare it to ∆ G . Model performance is reported as accuracy except for MIMIC-III and PovertyMap, where area under the curve (AUC) and Pearson Correlation Coefficient are used, respectively, due to change in classification task (See Table 1 ). In MIMIC-III, related work focuses on AUC because it is the standard metric used for diagnostics. Thus, we use AUC since it is the standard for evaluating clinical prediction models for mortality. In PovertyMap, predicts a real-valued composite asset wealth index from satellite images, and thus, the models are evaluated on the Pearson correlation (r) between their predicted and actual asset wealth indices.

5. STABILITY HAS POOR ROBUSTNESS-ACCURACY TRADEOFFS FOR COVARIATE AND SUBPOPULATION SHIFTS

In this section, we examine results from our synthetic covariate and subpopulation experiments seen in Table 2 and Table 3 . We investigate potential sources of why this tradeoff exists. We observe that for all shift severities in covariate CIFAR-C and subpopulation Waterbirds, SL has increased robustness as compared to ERM, with ∆ G > 0. However, for both shifts there is a tradeoff between robustness and accuracy, seen by the consistent negative accuracy gains, which increases with shift severity. We also find that SL is more robust on the natural covariate shifted COOS but at the expense of accuracy (Table 4 ). We find that this result holds across all values of stability we tried. This negative finding is not surprising because each of the shifts in COOS are covariate shifts. 4 : SL is more robust to most natural shifts found in COOS except for the most severe. This is most likely due to lower model accuracy. These results are for σ = 0.1. Standard deviations are not provided because of computational constraints. To gain insight into the nature of this robustness-accuracy tradeoff, we compare SL to a DRO algorithm that optimizes conditional value at risk (CVaR). Through empirical investigation we uncover that both methods fail on covariate shift (see Table 2 ). More interestingly, these methods are failing for different reasons. Investigation into the underlying cause has led to the formulation of following conjecture: We believe that SL mimics learning under a transformation of P (X) such that it becomes closer to the uniform distribution as we increase the level of stability. With enough uniform stability, P (X) would be the Uniform distribution, equivalent to eliminating all signal from the covariates and predicting randomly. Similar to covariate shift, as SL approaches the uniform distribution, information about the changed P (X|G) during subpopulation shift is lost. In contrast, the DRO CVaR optimization focuses on the tails of the distribution Levy et al. (2020) providing limited utility for most covariate/subpopulation shifts which oftentimes apply the same transform to every point in the distribution (Shimodaira, 2000; Duchi & Namkoong, 2018) . These results and the corresponding conjecture lead to the conclusion that SL under uniform stability is not a good candidate for improving robustness under covariate or subpopulation shift, as it comes at a major cost to accuracy.

6. STABILITY IMPROVES ROBUSTNESS AND ACCURACY TO LABEL SHIFT AND NATURAL SHIFTS

In this section, we demonstrate that stability improves both robustness and accuracy to label shifts and natural shifts. Furthermore, we investigate potential sources of this improvement compared to ERM. We draw on similarities between importance weighting, distributionally robust optimization, and stable learning to help understand these improvements. We demonstrate these results first on a variety of label shifts on Imbalanced-CIFAR (Table 5 ). Even as the shift severity increases, SL outperforms ERM, with ∆ G > 0 and a positive accuracy gain. However, these improvements occur at a specific level of stability σ = 0.1 (i.e. the amount of noise in DP-SGD). At stronger levels of stability we find that robustness is better than ERM at the expense of accuracy (Table 11 ). This is the first result of many throughout our work which indicates that level of stability is a hyperparameter which should be tuned to find the best level of robustness and accuracy. We demonstrate similar improvements when testing against natural distribution shifts. SL improves both robustness and accuracy on MIMIC-III and PovertyMap (Table 6 ). Specifically, we see an increase in accuracy of 2.9% and and 11.3% and increase in robustness of 4.2% and 0.8% respectively. Similar to our label shift results we had to tune the level of stability, again supporting the observation that it is a hyperparameter that must be tuned. We first investigate why stability outperforms ERM on both robustness and accuracy on label shifts. We demonstrate that stable learning (especially those that satisfy uniform stability like DP-SGD) mimics training that we would see if our training label distribution was uniform over all labels. We show this by showing similar results when training with DRO (to mimic a uniform label distribution) (Table 5 ). We show both of these methods also improve accuracy and robustness to a similar degree that SL does. Natural shifts are more difficult to accommodate than synthetic shifts because they are usually composed of multiple shifts. It is more difficult to train models to be robust against combinations of shifts because most methods are developed to deal with a single type of shift. Given that natural shifts are much harder than synthetic shifts we investigate why it is that SL is more accurate and robust. We identify that the datasets we considered contain a combination of shifts which make up the natural shift. Both MIMIC-III and PovertyMap contained both subpopulation and label shift. Thus, we believe that the improvement in performance and robustness is in part due SL being a much better learning algorithm for dealing with label shift. In this section, we investigate answers to query (ii) in Section 4. We explore to what level of stability is needed and how consistent the results of the above two sections are across different model architectures and hyperparameters. Overall, low levels of stability are required to see improvements over ERM. From Table 8 , Table 11 and Fig. 2 we observe that while increasing stability leads to a lower generalization gap, it decreases performance. Over all settings, we found that σ < 0.5 best balances accuracy-robustness tradeoffs. We find that for larger values because stability is guaranteed by use of noise in DP-SGD this results in much worse accuracy. In practice, the amount of stability needed to balance this tradeoff is model and dataset dependent. As such, stability can be treated as a hyperparameter to be tuned for robustness (similar to regularization), rather than a guaranteed solution.

7.2. BENEFITS OF STABLE LEARNING ARE CONSISTENT ACROSS ARCHITECTURES AND HYPERPARAMETERS

In our experiments we consistently observed lower generalization gaps across all hyperparameter settings of SL models. This indicates that the robustness improvements provided by SL hold across different model settings, and are not simply a result of well-chosen hyperparameters. In Fig. 1 , we examine three representative covariate, label, and subpopulation shifts. We find that the SL models more closely follow the ideal generalization trendline (in black) where the train set performance is equal to the test set performance. Additionally, we find that our results hold across a variety of commonly used model architectures. Specifically, we use a variety of CNNs and logistic regression models across our tasks and find that the findings do not change based on the model architecture. This is expected since algorithmic stability is agnostic to architecture and hyperparameter choices by definition.

8. CONCLUSION

Our study investigates the utility of stability as a tool for improving both robustness and accuracy to different distribution shifts. We find that by design, stability improves robustness at the expense of accuracy for both covariate shift and subpopulation shift. Meanwhile, also by design, stability improves both robustness and accuracy to label and natural shifts. We determine that this is because of the equal importance that uniform stability places on every data point in the training set. Finally, we show that these results are consistent across hyperparamters, model architectures, and shift severities.



Difference in Generalization Gap). Given the training and D and D ′ sampled from two different distributions P and Q, an ERM model M ERM (D) and alternate training algorithm A with model M A (D) and metric ϕ: D × Θ → R we define the generalization gap as:

Figure 1: Stability is consistent across hyperparameters. We plot the training vs. testing accuracy across 3 representative examples of a) covariate, b) label, and c) subpopulation shift from the CIFAR10-C, Imbalanced-CIFAR, and Waterbirds datasets, respectively. Each point in the graph represents a different hyperparameter experiment for the dataset. SL follows the y = x line more closely than ERM, indicating that the generalization gap of SL is lower than ERM and consistent across all hyperparameters.

(a) Accuracy and ∆G of covariate shift across stability levels (b) Accuracy and ∆G of label shift across stability levels

Figure 2: Robustness to distribution shift and accuracy are at odds for covariate shift and label shift when we use stable learning for models trained on CIFAR-C and Imbalanced-CIFAR respectively. This tradeoff worsens as the level of stability is increased.

The datasets, prediction tasks, and model architectures used throughout the paper to evaluate the relationship between algorithmic stability and distribution shift.We investigate synthetic covariate, label, and subpopulation shifts, as well as naturally occurring instances of covariate and subpopulation shifts. We use the definition of natural and synthetic shift fromTaori et al. (2020). Synthetic shifts are those where the data originally is all from the same distribution but is manipulated such that the training and test distributions are different. Meanwhile, natural shifts are those where the training and test distributions are already different.

or Tensorflow Privacy package(McMahan & Andrew, 2018) to implement DP-SGD. Models and data references are given in Table1. Note that the models used for the CIFAR dataset(Tramèr & Boneh, 2021) in covariate and label shift are small models created for differential privacy, to mitigate the dimensional dependence of DP-SGDYu et al. (2017). Thus, the reported ERM test accuracy is lower than state-of-the-art performance using larger model architectures such as the ResNet(He et al., 2016) on the same dataset. When we compare against distributionally robust optimization (DRO) results, we use the conditional value at risk (CVaR) optimization algorithm(Lévy et al., 2020). Test each model in the sets {M 1 , M 2 , ...} on shifted testing datasets {D ′ 1 , D ′ 2 ...}. While covariate and subpopulation shift have shifted test datasets, label shift has a shifted training dataset, as practitioners would see in a scenario of class-imbalanced training data.

SL demonstrates an accuracy-robustness tradeoff in the CIFAR-C dataset representing synthetic covariate shifts. We observe that for all shifts, SL ∆ G > 0, indicating that each SL model is more robust than ERM. However, there is a consistent loss in accuracy for both SL. We present results across five shift severities for σ = 0.1. We observe a similar tradeoff for the DRO CVaR algorithm, indicating that these algorithms cannot improve both robustness and accuracy with covariate shift.

Shift Severity SL train acc SL test acc ERM train acc ERM test acc Accuracy Gain∆ G

SL improves robustness and accuracy of models to label shift in Imbalanced-CIFAR. We present results across five shift severities for σ = 0.1. We observe that for all shifts, ∆ G > 0, indicating that each SL model is more robust and accurate than ERM in the presence of label shifts. We observe improvements in both robustness and accuracy for the DRO CVaR algorithm, indicating that there is no tradeoff between the two.

SL improves robustness and performance of models in the presence of natural subpopulation shifts. Here we use σ = 0.1 for MIMIC-III and σ = 0.001 for PovertyMap.

