A THEORY OF DYNAMIC BENCHMARKS

Abstract

Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.

1. INTRODUCTION

In response to concerns around the limitations of static datasets as benchmarks, researchers have proposed dynamic benchmarking-a setting where data collection and model building happen iteratively in tandem-as an alternative (Nie et al., 2020; Potts et al., 2021; Kiela et al., 2021; Ma et al., 2021; Gehrmann et al., 2021) . In dynamic benchmarking, model builders fit models against the current dataset, while annotators contribute new data points selected to challenge previously built models. In doing so, the hope is that the iterative process results in a more diverse set of test cases that can help induce better model performance. Though proponents argue "dynamic adversarial data collection, where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets," there is also a recognition that "the long-term benefits or drawbacks of adopting it as a core dataset creation paradigm remain poorly understood." (Wallace et al., 2022) A major concern is that adversarial data collection proliferates idiosyncratic examples that do well in fooling models but eliminate coverage of necessary yet easier test cases. This can, in turn, reduce dataset diversity and limit external validity (Bowman & Dahl, 2021) . A growing line of theoretical and empirical research on static benchmarks has improved our understanding of the strengths and limitations of this setting. In contrast, similar research on dynamic benchmarks has been limited. The high complexity and cost of studying live benchmarks impede experimental work. A stronger theoretical foundation for dynamic benchmarks could help guide empirical explorations of the vast design space, avoiding costly trial-and-error experiments. However

