DATA FEEDBACK LOOPS: MODEL-DRIVEN AMPLIFICATION OF DATASET BIASES

Abstract

Datasets scraped from the internet have been critical to large-scale machine learning. Yet, its success puts the utility of future internet-derived datasets at potential risk, as model outputs begin to replace human annotations as a source of supervision. In this work, we formalize a system where interactions with one model are recorded as history and scraped as training data in the future. We then analyze its stability over time by tracking changes to a test-time bias statistic (e.g. gender bias of model predictions). We find that the degree of bias amplification is closely linked to whether the model's outputs behave like samples from the training distribution, a behavior which we characterize and define as consistent calibration. Experiments in three conditional prediction scenarios -image classification, visual role-labeling, and language generation -demonstrate that models that exhibit a sampling-like behavior are more calibrated and thus more stable. Based on this insight, we propose an intervention to help calibrate and stabilize unstable feedback systems.

1. INTRODUCTION

Due to the successes of large-scale training in machine learning (He et al., 2016; Brown et al., 2020; Radford et al., 2021) , datasets derived from publicly available internet data have become indispensable to the machine learning community. For example, without relying on internet scraping, it would be cost-prohibitive to manually construct key datasets such as ImageNet (Deng et al., 2009) , The Pile (Gao et al., 2020) , or YFCC100M (Thomee et al., 2016) . While the internet has served as a large, easily-accessible source of human generated data in the past, the growing deployment of machine learning systems puts this procedure at risk. As models begin to create and annotate a significant fraction of internet content, the utility of the internet as a data source may decrease rapidly. As an example in visual role-labeling, consider a classifier trained on public photos and their associated tags, as depicted in Figure 1 . Instead of manually tagging photos, some users may instead choose to auto-tag their photos with the model. These photos, now stored in internet history, may be scraped as training data for an updated iteration of the image-tagging model. Any systematic biases introduced by the model, such as consistently mislabeling female doctors as nurses as in Figure 1 , are now encoded into the training data. This data feedback gradually degrades the quality of the internet as a data source, since supervision becomes driven by model outputs rather than human annotation. Issues stemming from having previously model-generated content included in training data have already been encountered in machine translation (Venugopal et al., 2011) and speech recognition (Radford et al., 2022) . These concerns are especially important in situations where model predictions may exacerbate existing toxicity, harm, or other biases (Gehman et al., 2020; Zhao et al., 2017) . In such cases, a viable strategy for model developers is to weigh the benefit of updating their model to new internet content versus the cost of amplifying biases via such model-induced feedback. However, it is not yet understood when and to what degree data feedback is an issue in practice. In this work, we define the data feedback setting and carefully study how model biases change under feedback. In particular, we ask: Are there conditions that stabilize bias amplification? We answer this in the affirmative, finding that one crucial path to achieving stability guarantees is having a consistently calibrated training procedure -one that produces models with a bias similar to its training distribution. Furthermore, this form of calibration can be realistically achieved in natural experimental settings. Specifically, models that behave like samplers (i.e. replicate their training distribution well) are more likely to be calibrated and thus more stable. In addition, many prediction algorithms that do not explicitly perform sampling, such as image classifiers, fulfill this behavior through a conjectured phenomenon called Distributional Generalization (Nakkiran & Bansal, 2020) . Formally, we quantify the stability of data feedback with a bias metric ϕ(x, ŷ), where ŷ = f t (x) are predictions from the model at time t. For example, the predictions ŷ are image tags or sentence completions, and the bias metrics ϕ are gender bias or sentence toxicity. Our theoretical result shows that if the model does not increase bias by more than error δ, then the total bias amplification is bounded by m+k m δ, where m and k refer to the number of new human-annotated samples and model-annotated samples respectively. Thus both a smaller calibration error δ and a higher fraction of human-annotated samples m contribute to the global stability of data feedback loops. The rest of the paper is organized as follows. In Section 3, we define the data feedback setting in more detail. We then describe a specific notion of calibration (consistent calibration), discuss its connection to sampling, and show how it gives rise to bounds on bias amplification in Section 4. Section 5 demonstrates the utility of these predictions empirically in three different natural experiment settings: 1. First, we define a simple data feedback setting in CIFAR (Krizhevsky, 2009) , where the label distribution is skewed and feedback has the potential to amplify label shift. In this case, we show the feedback dynamics are stable and consistent with our theoretical predictions. 2. Next, we show that data feedback can significantly amplify gender biases in a visual semantic role labeling task (Yatskar et al., 2016) . Our bounds predict that the dynamics may be unstable since the initial calibration error is large, which is consistent with gender bias amplification identified in earlier work (Zhao et al., 2017) . 3. Third, we examine data feedback for language generation on a toxic prompts dataset (Gehman et al., 2020) and demonstrate that toxicity and repetition amplify, with samplingbased generation schemes enjoying substantially higher stability than beam search methods. Finally, to conclude Section 5, we design an intervention to stabilize beam search methods by leveraging the sampling-like behavior of interpolating classifiers (Nakkiran & Bansal, 2020) . To do this, we train a language model that overfits to its training set and observe that this procedure significantly stabilizes the model's toxicity and repetition.

2. RELATED WORK

Performative prediction. The general problem of model-induced feedback in machine learning has been previously studied as performative prediction and strategic classification (Perdomo et al., 2020; Hardt et al., 2016) , where future data distributions can change arbitrarily in response to the deployed model. In this context, existing work has focused on methods that optimize towards equilibria of the system (Brown et al., 2022) . The generality of the problem setting allows for complex human interactions in-the-loop; however, it is for this reason that experimental evaluation has been limited, and most analyses have focused on convex settings with experiments on Gaussian data or simple synthetic data such as loan applications or credit risk (Izzo et al., 2021; Miller et al., 2021) . In contrast, motivated by the image tagging example in Section 1, we consider a more restricted form of feedback, in which new data examples are gathered only from either the "true" human-annotated distribution or predictions of the currently deployed model. This restriction allows us to analyze feedback stability in more realistic experimental settings and derive bounds on stability. Bias amplification. Machine learning models have a tendency to amplify at test-time biases that exist in their training data, a problem known as bias amplification (Dinan et al., 2019; Leino et al., 



Figure 1: A simple example of data feedback. An image-tagging model is trained on images from the internet. Some users auto-tag new images with the model and post them online, while others continue manually tagging their images. After some time, the model may be updated by re-scraping the internet and re-training on the updated data, which now includes feedback from previous model predictions.

