A THEORY OF DYNAMIC BENCHMARKS

Abstract

Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.

1. INTRODUCTION

In response to concerns around the limitations of static datasets as benchmarks, researchers have proposed dynamic benchmarking-a setting where data collection and model building happen iteratively in tandem-as an alternative (Nie et al., 2020; Potts et al., 2021; Kiela et al., 2021; Ma et al., 2021; Gehrmann et al., 2021) . In dynamic benchmarking, model builders fit models against the current dataset, while annotators contribute new data points selected to challenge previously built models. In doing so, the hope is that the iterative process results in a more diverse set of test cases that can help induce better model performance. Though proponents argue "dynamic adversarial data collection, where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets," there is also a recognition that "the long-term benefits or drawbacks of adopting it as a core dataset creation paradigm remain poorly understood." (Wallace et al., 2022) A major concern is that adversarial data collection proliferates idiosyncratic examples that do well in fooling models but eliminate coverage of necessary yet easier test cases. This can, in turn, reduce dataset diversity and limit external validity (Bowman & Dahl, 2021) . A growing line of theoretical and empirical research on static benchmarks has improved our understanding of the strengths and limitations of this setting. In contrast, similar research on dynamic benchmarks has been limited. The high complexity and cost of studying live benchmarks impede experimental work. A stronger theoretical foundation for dynamic benchmarks could help guide empirical explorations of the vast design space, avoiding costly trial-and-error experiments. However there has been no apparent theory of dynamic benchmarking that could offer provable guarantees about their performance and clarify how issues such as label noise interact with this setting.

1.1. OUR CONTRIBUTIONS

In this work, we initiate a theoretical study of dynamic benchmarks. We contribute a versatile formal model of dynamic benchmarks that serve as the basis for our investigation. We start with a fundamental question: Question 1: Can we design dynamic benchmarks in such a way that models continue to improve as the number of rounds of data collection grows? We start from a theoretical model capturing existing implementations of dynamic benchmarking. This model proceeds in multiple rounds interweaving data collection and model building sequentially. In round t, model builders face a distribution D t and are tasked with finding a classifier h t that performs well on D t . We assume that model fitting succeeds in minimizing risk up to a positive classification error ϵ > 0. We further assume that annotators succeed in identifying the failure cases of the current model h t , giving us access to the uniform distribution D t over the error cases of the model h t . We determine the new distribution D t+1 by mixing D t and D t in some proportion. We assume a starting distribution D 0 on the instances of interest. We can think of D 0 as the distribution corresponding to standard data collection. Mirroring the motivation for dynamic benchmarking, this distribution might assign little to no weight to important families of instances. In particular, an error set of measure ϵ, which we assume we can achieve from the get-go, might contain many relevant instances. The goal is therefore to converge to well below the ϵ-error level guaranteed by the above assumption. We assume the distribution admits a perfect classifier so that process could, in principle, converge to 0 error. In this setting, we show that three rounds are guaranteed to converge to O(ϵ 2 ) error. Unfortunately, this is where it ends. In general, there is no reason to expect this dynamic benchmark to progress below Ω(ϵ 2 ) error. Put differently, there is no provable benefit to dynamic data collection beyond three rounds. The cause of our negative result is a form of catastrophic forgetting that mirrors the concerns quoted earlier. As the benchmark moves beyond three rounds, there is provably no way to retain knowledge of instances correctly classified at earlier stages. Furthermore, we show through experiments that this lower bound may also be encountered in practice, preventing dynamic benchmarks from progressing beyond a small number of rounds. In doing so, we propose a concrete way to simulate the performance of a dynamic benchmark that may be of independent interest in the empirical study of dynamic benchmarks. There is yet another impediment to successful dynamic benchmarking: Above we considered the case where the underlying learning problem is realizable, meaning that there exists a model that achieves 0 error on the distribution. In practice, unrealizable settings where we have label noise are commonplace. Unrealizability can result from, for instance, annotator disagreement where there is an emerging line of work aiming to understand their impact on data diversity, label noise, and model performance. We show that in this unrealizable setting, dynamic benchmarks concentrate on mislabeled instances, losing their representativeness of the underlying distribution. Though pessimistic, the above negative results may be inherent to the simple sequential design of dynamic benchmarks currently used in practice. To further probe this issue, we ask: Question 2: Are there more sophisticated dynamic benchmark designs that can guarantee convergence below the error barrier of the standard setting? We answer this question in the affirmative by considering a hierarchical model, which recursively uses the above sequential setting as a building block. In this setting, the organizer of a benchmark creates multiple instances of dynamic data collection and combines the outcomes in a particular way, e.g., by ensembling the resulting models and feeding the output into a new instance. We study the setting where the hierarchy has depth two and show that this setting guarantees convergence to error O(ϵ 3 ), providing a strict separation with the standard model. Despite the improved performance, this depth-two setting significantly complicates the benchmarking process, and executing a design with further depth may be prohibitive in practice.

