Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Abstract

Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in 'untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeologyuncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

1. Introduction

Modern machine learning is characterized by ever-larger datasets and models. The expanding scale has produced impressive progress (Wei et al., 2022; Kaplan et al., 2020; Roberts et al., 2020) yet presents both optimization and auditing challenges. Real-world dataset collection techniques often result in significant label noise (Vasudevan et al., 2022) , and can present significant numbers of redundant, corrupted, or duplicate inputs (Carlini et al., 2022) . Scaling the size of our datasets makes detailed human analysis and auditing labor-intensive, and often simply infeasible. These realities motivate a consideration of how to efficiently characterize different aspects of the data distribution. Prior work has developed a rough taxonomy of data properties, or metadata which different examples might exhibit, including but not limited to: noisy (Wu et al., 2020; Yi and Wu, 2019; Thulasidasan et al., 2019a; b), atypical (Hooker et al., 2020; Buolamwini and Gebru, 2018; Hashimoto et al., 2018; S lowik and Bottou, 2021), challenging (Ahia et al., 2021; Baldock et al., 2021; Paul et al., 2021; Agarwal et al., 2021) , prototypical or core subset selection (Paul et al., 2021; Sener and Savarese, 2018; Shim et al., 2021; Huggins et al., 2017; Sorscher et al., 2022) and out-of-distribution (Hendrycks et al., 2019; LeBrun et al., 2022) . While important progress has been made on some of these metadata categories individually, these categories are typically addressed in isolation reflecting an overly strong assumption that only one, known issue is at play in a given dataset. For example, considerable work has focused on the issue of label noise. A simple yet widelyused approach to mitigate label noise is to remove the impacted data examples (Pleiss et al., 2020) . However, it has been shown that it is challenging to distinguish difficult examples The underlying issue with such approaches is the assumption of a single, known type of data issue. Interventions are often structured to identify examples as simple vs. challenging, clean vs. noisy, typical vs. atypical, in-distribution vs. out-of-distribution etc. However, large scale datasets may present subsets with many different properties. In these settings, understanding the interactions between an intervention and many different subsets of interest can help prevent points of failure. Moreover, relaxing the notion that all these properties are treated independently allows us to capture realistic scenarios where multiple metadata annotations can apply to the same datapoint. For example, a challenging example may be so because it is atypical. In this work, we are interested in moving away from a siloed treatment of different data properties. We use the term Metadata Archaeology to describe the problem of inferring metadata across a more complete data taxonomy. Our approach, which we term Metadata Archaeology via Probe Dynamics (MAP-D), leverages distinct differences in training dynamics for different curated subsets to enable specialized treatment and effective labelling of different metadata categories. Our methods of constructing these probes are general enough that the same probe category can be crafted efficiently for many different datasets with limited domain-specific knowledge. We present consistent results across six image classification datasets, CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009 ), ImageNet (Deng et al., 2009 ), Waterbirds (Sagawa et al., 2020 ), CelebA (Liu et al., 2015 ) , Clothing1M (Xiao et al., 2015) and two models from the ResNet family (He et al., 2016) . Our simple approach is competitive with far more complex mitigation techniques designed to only treat one type of metadata in isolation. We summarize our contributions as: • We propose Metadata Archaeology, a unifying and general framework for uncovering latent metadata categories. • We introduce and validate the approach of Metadata Archaeology via Probe Dynamics (MAP-D): leveraging the training dynamics of curated data subsets called probe suites to infer other examples' metadata.



Figure 1: Examples surfaced through the use of MAP-D on ImageNet train set. Column title is the ground truth class, row title is the metadata category assigned by MAP-D. MAP-D performs metadata archaeology by curating a probe set and then probing for similar examples based on training dynamics. This approach can bring to light biases, mislabelled examples, and other dataset issues. from noisy ones, which often leads to useful data being thrown away when both noisy and atypical examples are present (Wang et al., 2018; Talukdar et al., 2021).Meanwhile, loss-based prioritization(Jiang et al., 2019; Katharopoulos and Fleuret, 2018)   techniques essentially do the opposite -these techniques upweight high loss examples, assuming these examples are challenging yet learnable. These methods have been shown to quickly degrade in performance in the presence of even small amounts of noise since upweighting noisy samples hurts generalization(Hu et al., 2021; Paul et al., 2021).

