A THEORY OF DYNAMIC BENCHMARKS

Abstract

Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.

1. INTRODUCTION

In response to concerns around the limitations of static datasets as benchmarks, researchers have proposed dynamic benchmarking-a setting where data collection and model building happen iteratively in tandem-as an alternative (Nie et al., 2020; Potts et al., 2021; Kiela et al., 2021; Ma et al., 2021; Gehrmann et al., 2021) . In dynamic benchmarking, model builders fit models against the current dataset, while annotators contribute new data points selected to challenge previously built models. In doing so, the hope is that the iterative process results in a more diverse set of test cases that can help induce better model performance. Though proponents argue "dynamic adversarial data collection, where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets," there is also a recognition that "the long-term benefits or drawbacks of adopting it as a core dataset creation paradigm remain poorly understood." (Wallace et al., 2022) A major concern is that adversarial data collection proliferates idiosyncratic examples that do well in fooling models but eliminate coverage of necessary yet easier test cases. This can, in turn, reduce dataset diversity and limit external validity (Bowman & Dahl, 2021) . A growing line of theoretical and empirical research on static benchmarks has improved our understanding of the strengths and limitations of this setting. In contrast, similar research on dynamic benchmarks has been limited. The high complexity and cost of studying live benchmarks impede experimental work. A stronger theoretical foundation for dynamic benchmarks could help guide empirical explorations of the vast design space, avoiding costly trial-and-error experiments. However there has been no apparent theory of dynamic benchmarking that could offer provable guarantees about their performance and clarify how issues such as label noise interact with this setting.

1.1. OUR CONTRIBUTIONS

In this work, we initiate a theoretical study of dynamic benchmarks. We contribute a versatile formal model of dynamic benchmarks that serve as the basis for our investigation. We start with a fundamental question: Question 1: Can we design dynamic benchmarks in such a way that models continue to improve as the number of rounds of data collection grows? We start from a theoretical model capturing existing implementations of dynamic benchmarking. This model proceeds in multiple rounds interweaving data collection and model building sequentially. In round t, model builders face a distribution D t and are tasked with finding a classifier h t that performs well on D t . We assume that model fitting succeeds in minimizing risk up to a positive classification error ϵ > 0. We further assume that annotators succeed in identifying the failure cases of the current model h t , giving us access to the uniform distribution D t over the error cases of the model h t . We determine the new distribution D t+1 by mixing D t and D t in some proportion. We assume a starting distribution D 0 on the instances of interest. We can think of D 0 as the distribution corresponding to standard data collection. Mirroring the motivation for dynamic benchmarking, this distribution might assign little to no weight to important families of instances. In particular, an error set of measure ϵ, which we assume we can achieve from the get-go, might contain many relevant instances. The goal is therefore to converge to well below the ϵ-error level guaranteed by the above assumption. We assume the distribution admits a perfect classifier so that process could, in principle, converge to 0 error. In this setting, we show that three rounds are guaranteed to converge to O(ϵ 2 ) error. Unfortunately, this is where it ends. In general, there is no reason to expect this dynamic benchmark to progress below Ω(ϵ 2 ) error. Put differently, there is no provable benefit to dynamic data collection beyond three rounds. The cause of our negative result is a form of catastrophic forgetting that mirrors the concerns quoted earlier. As the benchmark moves beyond three rounds, there is provably no way to retain knowledge of instances correctly classified at earlier stages. Furthermore, we show through experiments that this lower bound may also be encountered in practice, preventing dynamic benchmarks from progressing beyond a small number of rounds. In doing so, we propose a concrete way to simulate the performance of a dynamic benchmark that may be of independent interest in the empirical study of dynamic benchmarks. There is yet another impediment to successful dynamic benchmarking: Above we considered the case where the underlying learning problem is realizable, meaning that there exists a model that achieves 0 error on the distribution. In practice, unrealizable settings where we have label noise are commonplace. Unrealizability can result from, for instance, annotator disagreement where there is an emerging line of work aiming to understand their impact on data diversity, label noise, and model performance. We show that in this unrealizable setting, dynamic benchmarks concentrate on mislabeled instances, losing their representativeness of the underlying distribution. Though pessimistic, the above negative results may be inherent to the simple sequential design of dynamic benchmarks currently used in practice. To further probe this issue, we ask: Question 2: Are there more sophisticated dynamic benchmark designs that can guarantee convergence below the error barrier of the standard setting? We answer this question in the affirmative by considering a hierarchical model, which recursively uses the above sequential setting as a building block. In this setting, the organizer of a benchmark creates multiple instances of dynamic data collection and combines the outcomes in a particular way, e.g., by ensembling the resulting models and feeding the output into a new instance. We study the setting where the hierarchy has depth two and show that this setting guarantees convergence to error O(ϵ 3 ), providing a strict separation with the standard model. Despite the improved performance, this depth-two setting significantly complicates the benchmarking process, and executing a design with further depth may be prohibitive in practice. In search of alternative designs that can consistently improve the model's performance, we study a complementary design to dynamic benchmarks in Section A of Appendix. Instead of accumulating adversarial examples in a dynamic benchmark, here the model-in-the-loop carries the information by directly using new examples and becoming more complex throughout the process. We also make a natural connection to boosting methods. Despite achieving zero risk theoretically, this alternative has limited applicability due to either slow convergence or computational infeasibility. In sum, our results indicate that current bottlenecks observed in empirical settings under the sequential model are inherent to the set-up. This can alert practitioners to the limitations of current practice before many rounds of data are collected. Further, more complex designs such as the hierarchical setting can result in improved performance but may suffer from the organizational complexity of data collection. Combined, these results highlight stark tradeoffs in switching from static to dynamic settings and suggest that exploration of the design space for modeling dynamic benchmarks can play an important role.

1.2. RELATED WORKS

For an introduction to datasets as benchmarks, see Chapter 8 in Hardt & Recht (2022) . Concerns around static benchmarks are summarized in recent works, including adaptivity (Dwork et al., 2015) , violation of sample independence in sequentially generated data (Shirali, 2022) , and issues of annotator disagreement (Pavlick & Kwiatkowski, 2019; Prabhakaran et al., 2021; Davani et al., 2022) . Numerous new benchmarks and benchmarking systems have recently been proposed that integrate some aspects of dynamic data collection. Adversarial data collection continually adds challenging examples found by annotators for an existing model (Dinan et al., 2019; Nie et al., 2020; Kiela et al., 2021; Potts et al., 2021; Wallace et al., 2022) . Empirical studies show this does not necessarily lead to better performance or robustness (Kaushik et al., 2021; Wallace et al., 2022) . A dynamic leaderboard periodically renews the test set (Zellers et al., 2021) . In response to the fast growth of dynamic benchmarking, various tools and platforms are also developed. For example, platforms for adversarial data collection (Kiela et al., 2021) , assistance of annotators to find challenging examples (Bartolo et al., 2021) , personalized benchmarking (Narayan et al., 2021) , and automatic crowdsourcing of leaderboard submissions (Khashabi et al., 2021) . Adversarial filtering, which filters out examples from a static dataset that are identified to be easy for a given model, is another related technique (Paperno et al., 2016; Zellers et al., 2018; Le Bras et al., 2020) . Such datasets are susceptible to being biased (Phang et al., 2022) or saturate faster than static datasets (Taori et al., 2020) . Le Bras et al. (2020) include theoretical considerations on eliminating bias. The flurry of newly minted benchmarks stands in stark contrast with the scarcity of theory on the topic. Our work was inspired by the thought-provoking discussion of dynamic benchmarks by Bowman & Dahl (2021) .

2. PROBLEM FORMULATION

Our primary goal in this work is to understand the population-level dynamics that a benchmark design induces. We are centrally interested in capturing what the iterative process of model building and data collection converges to. Consequently, we ignore finite sample issues in our formulation and focus on distributions rather than samples. We assume that model builders successfully minimize risk approximately. While risk minimization may be computationally hard in the worst case, this assumption reflects the empirical reality that machine learning practitioners seem to be able to make consistent progress on fixed benchmarks. While we focus on population-level dynamics in the design of benchmarks, we restrict ourselves to operations with standard finite sample counterparts. A dynamic benchmark design can be represented as a directed acyclic graph where the nodes and edges correspond to classifiers, distributions, and the operations defined below: 1. Model building: Given a distribution, find an approximate risk minimizer. 2. Data collection: Given a model, find a new distribution. 3. Model combination: Combine a set of models into a single model. 4. Data combination: Combine a set of distributions into a single distribution. We describe these operations in turn. First, we explain model building by defining the notion of risk and risk minimization. The risk of a classifier h : X → Y on a distribution P supported on the data universe X × Y with respect to the zero-one loss is defined as ! ! ℎ ! # ! ! ℎ " # ! " ℎ # + + ⋯ ! H H ! " ! # ! ! R P (h) = E (x,y)∼P 1 h(x) ̸ = y . An ϵ-approximate risk minimizer A is an algorithm that takes a distribution P as input and returns a classifier h : X → Y such that R P (h) ≤ min h∈H R P (h) + ϵ , where H is a family of classifiers. In the benchmark setting, risk minimization is a collective effort of numerous participants. The algorithm A represents these collective model-building efforts. We abstract data collection as an operation that takes a classifier h and, for an underlying distribution D, returns a new distribution D. In the context of adversarial data collection, we assume that annotators can find the conditional distribution over instances on which h errs D = D| h(x)̸ =y . This is an idealized representation of data collection, that ignores numerous real-world issues. Nonetheless, this idealized assumption will make our negative results even stronger. Model combination happens via a weighted majority vote among multiple classifiers. Distribution combination takes a mixture of multiple given distributions according to some proportions. The final piece in our problem formulation is the ultimate success criterion for a benchmark design. One natural goal of a dynamic benchmark is to output a classifier that minimizes risk on a fixed underlying distribution D. We envision that this distribution represents all instances of interest. There is a subtle aspect to this choice of a success criterion. If we already assume we have an ϵ-approximate risk minimizer on D, why are we not done from the get-go? The reason is that the ϵerror term might cover instances crucially important for the success of a classifier in real tasks. After all, a 95% accurate classifier can still perform poorly in practice if real-world instances concentrate on a set of small measure in D. The goal is therefore to find classifiers achieving risk significantly below the error level guaranteed by assumption. By asking for successively higher accuracy, we can ensure that the benchmark continues to elicit better-performing models as time goes on. Notation. We define error set of a classifier h as E h = {(x, y) ∈ X × Y | h(x) ̸ = y}. Note that R P (h) = Pr P (E h ). We drop P from Pr P (•) when this is clear from the context. We say a classification problem is realizable on distribution P if there is a classifier f ∈ H such that R P (f ) = 0. Here, H is the hypothesis class and f is called the true classifier. For realizable problems, no uncertainty will be left over Y when x ∈ X is drawn, so we use P referring to the distribution over X and define error sets as a subset of X . Let p P be the probability density function associated with P. We say P is conditioned on E ⊆ X and denote it by P| E if for any x ∈ X we have p P| E (x) = p P (x|E). For notational convenience, we sometimes use P(x) in place of p P (x). The support supp(P) of a distribution P is the largest subset of X such that P(x) > 0 for all x ∈ supp(P). Given probability distributions P 1 , P 2 , . . . , P T , we denote the mixture distribution with weights w t ≥ 0 such that t w t = 1 by mix(P 1 , P 2 , • • • , P T ), where p mix (x) = T t=1 w t P t (x). For a set of classifiers h 1 , h 2 , . . . , h T , we denote the weighted majority vote classifier by maj(h 1 , h 2 , • • • , h T ).

3. PATH DYNAMIC BENCHMARKS

The simplest case of a dynamic benchmark corresponds to the design of a directed path interleaving model building and data collection as illustrated in Figure 1 . This is the design most similar to current proposals of dynamic benchmarks and adversarial data collection (Nie et al., 2020; Kiela et al., 2021) . Starting from an initial distribution, at each round, a new classifier is obtained from the latest distribution. The annotators are then asked to find the vulnerabilities of this model. This new insight will be leveraged towards updating the latest distribution. We call this procedure path dynamic benchmarking. Path dynamic benchmarking: For an underlying distribution D with true classifier f , given initial distribution D 0 and an approximate risk minimizer A, at each round t: 1. h t = A(D t ) 2. D t = D| ht(x)̸ =f (x) 3. D t+1 = mix(D 0 , D 0 , D 1 , D 2 , • • • , D t ) We first formalize the rationale behind path dynamic benchmarks. Ideally, given a perfect, i.e. 0-approximate, risk minimizer, every time the current classifier misclassifies some part of the underlying distribution, annotators reveal that part and the updated classifier will avoid repeated mistakes. Since errors will not be repeated across the sequence, there can be a limited number of very bad classifiers. The following simple lemma formalizes this intuition for a target error level α > 0. Lemma 3.1. For any hypothesis class H, true classifier f ∈ H, perfect risk minimizer A, underlying distribution D, and initial distribution D 0 such that supp(D 0 ) ⊆ supp(D), let (h t ) T -1 t=0 be any sequence of classifiers obtained in a path dynamic benchmark with equally weighted mix(•). Then, for any α > 0, there are at most 1 α classifiers of risk more than α. In other words, |{t < T | R D (h t ) > α}| ≤ 1 α . See proof on page 15. The lemma does not guarantee the latest classifier's risk, but it is straightforward to see a random selection of the classifiers after many rounds are accurate with high probability (see Corollary C.1). A more effective way to construct an accurate classifier from the sequence of classifiers is to take their majority vote. In this case, three rounds of model building suffice to find a perfect classifier. Proposition 3.2. Under the conditions of Lemma 3.1, let (h t ) T -1 t=0 be any sequence of classifiers obtained in a path dynamic benchmark with uniform mixture weights. If T ≥ 3, R D maj(h 0 , h 1 , • • • , h T -1 ) = 0. Proof. From the proof of Lemma 3.1 we know E t ∩ supp(D t ) = 0. So, E t ∩ E t ′ = 0 for every t ′ < t. The majority vote of h t s will misclassify x if half or more of h t s misclassify x. But no two distinct h t s make a common mistake. So, for three or more classifiers, the majority vote classifier is always correct. So far, path dynamic benchmarking seems to be a promising choice when a perfect risk minimizer is available and the problem is realizable. The situation changes significantly when we go to approximate risk minimizers. We first study a three-round path dynamic benchmark. We then show how results would generalize for an arbitrary number of rounds. Our results apply to the case where the initial and underlying distributions D 0 do not need to be identical to the target distribution D. To measure the distance between distributions with respect to a hypothesis class, we use the following notion. Definition 3.3 (See Ben-David et al. ( 2010)). For a hypothesis class H and distributions P 1 and P 2 , the H∆H-distance between P 1 and P 2 is defined as d H∆H (P 1 , P 2 ) = sup h,h ′ ∈H E x∼P1 [1{h(x) ̸ = h ′ (x)}] -E x∼P2 [1{h(x) ̸ = h ′ (x)}] . (1) The next theorem discusses how path dynamic benchmarking with three rounds performs in the case of an ϵ-approximate risk minimizer. Theorem 3.4. For any hypothesis class H, true classifier f ∈ H, underlying distribution D, initial distribution D 0 with supp(D 0 ) ⊆ supp(D), and any ϵ-approximate risk minimizer A, let h 0 , h 1 , and Published as a conference paper at ICLR 2023 h 2 be the three classifiers obtained after three model building rounds in a path dynamic benchmark with uniform mixture weights. Then, the risk of the majority vote classifier is bounded by R D maj(h 0 , h 1 , h 2 ) ≤ O ϵ 2 + ϵd H∆H (D 0 , D) . (2) Note that for sufficiently similar D 0 and D, i.e., d H∆H (D 0 , D) = O(ϵ), risk is bounded by O(ϵ 2 ). See proof on page 15. The obtained O(ϵ 2 ) error with only three rounds of model building is a significant improvement to the O(ϵ) error that could be achieved with static benchmarks and an ϵ-approximate risk minimizer. We then consider what happens if we continue dynamic benchmarking for many rounds. Theorem 3.5. For any ϵ-approximate risk minimizer A with 1 ϵ ∈ N, hypothesis class H with VCdim(H) ≥ 8 ϵ 2 , and any path dynamic benchmark with L ≥ 3 rounds of model building and arbitrary mixture weights, there exists an underlying distribution D such that for any true classifier f ∈ H and initial distribution D 0 with supp(D 0 ) ⊆ supp(D), there exists a sequence (h t ) L-1 t=0 of classifiers consistent with path dynamic benchmark where the risk of their weighted majority vote is lowerbounded by R D maj(h 0 , h 1 , • • • , h L-1 ) ≥ ϵ 2 8 (3) for any weighting of maj(•). Further, Theorem C.2 shows for any path dynamic benchmark, there exists H with constant VC dimension such that a similar lower-bound holds. See proof on page 16. Theorem 3.5 shows that Ω(ϵ 2 ) error serves as a lower bound in the approximate risk minimizer setting for any path design (any mixture weighting and weighted majority). Then Theorem 3.4 shows three rounds of model building with uniform weights can achieve the lower bound, so it is optimal, and continuing dynamic benchmarking for more rounds might not be helpful.

3.1. CHALLENGES IN NON-REALIZABLE SETTINGS

For many reasons, our problem may not be realizable. For example, a class of functions that is not complex enough to explain the true model constitutes an unrealizable setting. Even for a complex enough class, annotators might fail to label instances correctly. In a simplified model for unrealizable problems, we assign random labels to a small part of the distribution and let the rest of the distribution be realizable. Formally, let X δ be the randomly labeled subset of X and X δ = X \ X δ be the rest of the domain labeled with f . Since no classifier can do well on X δ , dynamic benchmarks are prone to overrepresent X δ . This intuition is formalized in Theorem 3.6 where we show a significant portion of D t will be concentrated on X δ . Theorem 3.6. For any hypothesis class H, true classifier f , ϵ-approximate risk minimizer A, and any underlying distribution D such that δ-proportion of D is labeled randomly and the rest is labeled by f , if δ > ϵ, as long as t = O( δ ϵ ), at least Ω(1)-proportion of D t obtained through a path dynamic benchmark will be concentrated around the unrealizable instances, i.e., Pr Dt (x ∈ X δ ) = Ω(1).

See proof on page 18.

As a direct consequence, classifiers trained on a path dynamic benchmark lose their sensitivity to realizable instances and might show an unexpectedly bad performance on the rest of the distribution.

4. HIERARCHICAL DYNAMIC BENCHMARKS

Thus far, we have observed that given an ϵ-approximate risk minimizer, the smallest achievable risk through path dynamic benchmarks is Ω(ϵ 2 ). This observation evokes the idea of possibly achieving a squared risk by adding a new layer to the benchmarking routine, which calls path dynamic benchmarking as a subroutine. Figure 2 shows such a structure with three steps. At each step, path dynamic benchmarks are obtained starting from a mixture of the initial distribution and error distributions of all previous steps. In the second layer, models found in different steps are aggregated. We Hierarchical dynamic benchmarking: For an underlying distribution D and true classifier f , given an initial distribution D 0 and an approximate risk minimizer A, depth-k width-w hierarchical dynamic benchmarks are constructed recursively: ℎ ! ℎ " ℎ # !, H !, H % ! M ℎ $ ℎ % ℎ & !, H !, H % " M # ! ! + ! ! ! ! ℎ ' ℎ ( ℎ ) !, H !, H % # M # ! " + ! M ⋯ H H def. A (k) (D 0 ): 1. h 0 = A (k-1) (D 0 ) 2. For t = 1, • • • , w -1: (a) D t-1 = D| ht-1(x)̸ =f (x) (b) D t = mix(D 0 , D 0 , D 1 , • • • , D t-1 ) (c) h t = A (k-1) (D t ) 3. return maj(h 0 , h 1 , • • • , h w-1 ) where A (0) = A. Note that this setting does not mean that different steps of path dynamic benchmarking can be done in isolation. Running two benchmarking tasks independently does not yield benefits since humans in the loop might return similar vulnerabilities. In the depth-2 hierarchy of Figure 2 , we have bold-faced arrows corresponding to the sequence of steps that should be taken. So, any reasonable dynamic benchmarking based on our model is a sequential process, and the name hierarchy is meant to carry the intuition on how the aggregation of classifiers is happening. Also note that A (k) calls annotators for w k rounds. Since no known empirical dynamic benchmarking study has ever had more than 20 rounds (as studied in Wallace et al. ( 2022)), we limit our analysis to a depth-2 width-3 structure (Figure 2 ). Here w ≥ 3 is the necessary and sufficient number of rounds for path dynamic benchmarks to ensure O(ϵ 2 ) risk when A is ϵ-approximate risk minimizer and D 0 is sufficiently similar to D. The next theorem provides an upper bound on the risk of any classifier obtained through hierarchical data dynamic benchmarks of Figure 2 . Theorem 4.1. For any hypothesis class H, true classifier f ∈ H, underlying distribution D, initial distribution D 0 with supp(D 0 ) ⊆ supp(D), and any ϵ-approximate risk minimizer A, the risk of any classifier obtained from a hierarchical dynamic benchmark with depth-2 and width-3 (Figure 2 ) where mix(•) and maj(•) uniformly weight the inputs, is bounded by R D (maj(g 0 , g 1 , g 2 )) ≤ O ϵ 3 + ϵ 2 d H∆H (D 0 , D) . ( ) Note that for sufficiently similar D 0 and D, i.e., d H∆H (D 0 , D) = O(ϵ), risk is bounded by O(ϵ 3 ). See proof on page 19. The hope in adding a new layer to dynamic benchmarking was to obtain a squared risk of the previous layer's risk. So, ideally, a depth-2 hierarchical dynamic benchmark could achieve O(ϵ 4 ) error. But this is not the case; next, we show that the upper bound of Theorem 4.1 is tight up to a constant, and the conjecture of squared risk per layer does not hold. Theorem 4.2. For any ϵ-approximate risk minimizer A with 1 ϵ ∈ N, hypothesis class H with VCdim(H) ≥ 2 ϵ 3 , and any hierarchical dynamic benchmark with depth-2 and width-3 (Figure 2 ) with arbitrary mixture and majority weights, there exists an underlying distribution D such that for any true classifier f ∈ H and initial distribution D 0 with supp(D 0 ) ⊆ supp(D), there exists classifiers consistent with hierarchical dynamic benchmark for which R D maj(g 0 , g 1 , g 2 ) ≥ ϵ 3 2 . ( ) Further, Theorem C.3 shows for any hierarchical dynamic benchmark, there exists H with constant VC dimension such that a similar lower-bound holds. See proof on page 20. Combined, these results show that design structures that are more intricate than the current practice, as captured by the path dynamic benchmark, can yield improved results but also have strong limitations and are challenging to implement in practice.

5. EXPERIMENTS

Theorem 3.5, provides an Ω(ϵ 2 ) lower bound on the risk achievable with an ϵ-approximate risk minimizer. The proof of this theorem is constructive, introducing a bad sequence of distributions and classifiers consistent with the path design with Θ(ϵ 2 ) error. But the theorem does not rule out the existence of a good sequence achieving arbitrarily small risk. An important question is how frequently bad sequences appear in practice, retaining the error above zero even after many rounds. We study this question by simulating path dynamic benchmarks on two popular static benchmarks, CIFAR-10 (Krizhevsky et al., 2009) and SNLI (Bowman et al., 2015) . The details of the data and models are reported in Section B.1 of Appendix. Our aim in these experiments is not to obtain a state-of-the-art model. Instead, we want to study the effectiveness of path dynamic benchmarks in a controlled experimental setting with light models. Our simulation design is similar for both datasets: 1. Train a base classifier on the whole dataset and fix it for the next steps. 2. Construct a new dataset from samples correctly classified by the base model and define the true and initial distributions as a uniform (point mass) distribution over these samples. Note that in this case empirical risk on samples weighted according to a point mass distribution is equivalent to risk on that distribution. Since the base model correctly labels all selected samples, the problem is also realizable. 3. Draw multiple rollouts of path dynamic benchmarks. A rollout from path dynamic benchmark is a sequence of distributions and models obtained by alternatingly training a new classifier on the weighted extracted dataset and up-weighting (down-weighting) the distribution over misclassified (correctly classified) samples according to a mixture rule with uniform weights. Note that the base model is fixed, and the randomness across rollouts solely comes from different initializations of new classifiers and possible randomness in optimization methods. There are two deviations from our theoretical study in this design: First, we studied binary classification, but both CIFAR-10 and SNLI define multi-label problems. We believe our main arguments hold for multi-label tasks as well. Second, although the problem is realizable, the solution is unlikely to have zero risk as training a new classifier at each round of a rollout typically consists of non-convex optimization. This is addressed in our theoretical framework as ϵ-approximate risk minimization. Our analysis requires a fixed ϵ across all rounds which approximately holds in practice. Figure 3a shows results from 100 rollouts of path dynamic benchmarks simulated on CIFAR-10. At each round, the value on the vertical axis shows the risk of the majority vote of all the classifiers obtained in a rollout so far. The solid line is the average of the majority vote's risk of all rollouts, and the shaded area shows the standard deviation. There are a few observations: First, although zero risk is attainable, path dynamic benchmarks fail to reach it. Second, the variance across rollouts is quite consistent at each round. This shows a good rollout keeps being a good, and likewise with a bad rollout, so continuing dynamic benchmarking does not help. We further discuss why a rollout turns out to be a good or bad one and how this is related to our theoretical negative example in Section B.2 of the Appendix. Figure 3b shows the results from 50 rollouts of path dynamic benchmarks simulated on SNLI. A similar observation regarding non-zero risk in limit can be made here. Compared to the image classification task, path dynamic benchmarks in the NLI task show more fluctuation through rounds which might be due to the harder nature of the problem or a more complex model. Summarizing these observations, path dynamic benchmarks, no matter how long we run them for, confront a lower bound even in simple realizable settings. This provides empirical evidence that our negative results are not contrived, but point at what may be inherent limitations to dynamic benchmarking.

6. DISCUSSION

The scientific foundation of traditional benchmarks is the holdout method, whereby we split the dataset into a training and testing component. Although the machine learning community routinely operates well outside the statistical guarantees of the holdout method, at least there is some cognizable theoretical framework for static datasets as benchmarks. Moreover, more recent works have found new support for static benchmarks (Blum & Hardt, 2015; Recht et al., 2019) . Dynamic benchmarks provide an intriguing proposal aimed at mitigating known issues with static benchmarks. Platforms for dynamic benchmarks are already up and running. However, machine learning theory has little to offer in the way of guiding the principled design of valid dynamic benchmarks. Responding to this lack of theoretical foundations, we propose a formal model of dynamic benchmarks that allows to combine model building and data collection steps in a flexible manner. We focus on what we think is the first-order concern in the design of a benchmark: Does the benchmark, in principle, induce the creation of better models? While our results show a provable benefit to dynamic benchmarking, it is arguably more subtle than hoped and comes at a significant increase in complexity. Our negative results are particularly concerning given that we optimistically assume that both model builders and annotators have significant competence. In practice, dynamic benchmarks likely face additional obstacles neglected by our idealized assumptions. As a result, it is less clear to what extent our positive results have prescriptive value. On the other hand, our positive results make it clear that the design space for dynamic benchmarks is larger than currently utilized, thus pointing at a range of interesting open problems. Finally, lowering risk or improving the accuracy of the induced models is the ultimate success criterion of a benchmark, however, there are other concerns including the robustness of the models that can be the topic of future works based on our proposed framework.

A DYNAMIC GRADIENT-BASED UPDATE

The current practice of dynamic benchmarking aims at diversifying the benchmark by dynamically adding new hard examples to it. The hope is to obtain better models from the benchmark every time new samples are added. As we observed, despite the initial benefit this method has, there exist arbitrarily long dynamic benchmarks with no significant improvement after the first few steps. But if the ultimate goal is to obtain improved models, there is a more direct way to do it. Instead of keeping all the adversarial feedback in the form of a dynamic benchmark, the model-in-the-loop itself can carry all the information by becoming more and more complex throughout the process. In other words, rather than updating the benchmark by adversarial examples, these examples can be collected in a way that can be directly used to update the model. Dynamic adversarial feedback from annotators is helpful here by providing fresh examples at each round that prevent overfitting. This phenomenon is also close to the boosting technique in machine learning. In this section, we discuss how directly updating the model using adversarial data can be formulated within our framework and why they are not practical. Since we use gradient-based techniques to update the model, we call these methods dynamic gradient-based updating. We first formally introduce a vector notation for functions and distributions, which makes our arguments easier to follow. Then we discuss how a classifier's risk with respect to the zero-one loss would be represented with this notation. In search of a classifier that minimizes this risk, we minimize surrogate convex losses rather than directly optimizing for zero-one risk. Here we discuss two popular choices, to be named hinge and exponential losses, and for each, discuss the corresponding method with its strengths and limitations. Notation. Let h : X → {-1, 1} be a binary classifier defined on the finite set X . The vector representation of h in X is h = (h(x)) x∈X . Similarly, let P be a probability distribution over X . ! ! ℎ ! # ! ! $ ℎ ! ℎ " # ! " $ ℎ " ℎ # + + ⋯ ! ! ! H H The vector representation of P in X is P = (P(x)) x∈X . The entrywise product of two vectors h 1 and h 2 is denoted by h 1 • h 2 . For an underlying distribution D and true classifier f , we can use vector representations to write the risk with respect to the zero-one loss. Risk of a binary classifier h on D is R D (h) = 1 2 ⟨1 -h • f , D⟩. For a general h : X → R, still we can define the risk with respect to the zero-one loss as R D (h) = 1 2 ⟨1 -sign(h • f ), D⟩, where sign(•) is an entrywise operator. Upper-bounded risk. There are many ways to upper-bound R D (h). For example, for any entrywise function l(•) such that l(x) ≥ 1{x ≤ 0} = 1 2 -1 2 sign(x) for all x ∈ R, the risk of h with respect to the zero-one loss can be upper-bounded by R D (h) ≤ R l D (h) = ⟨l(h • f ), D⟩.

A.1 MINIMIZING HINGE LOSS

A popular function to upper-bound the zero-one risk is the hinge loss: l(x) = max(1 -x, 0). Plugging l(•) into Equation 6gives: R D (h) ≤ R hinge D (h) = ⟨max(1 -h • f , 0), D⟩, where max(•) is element-wise maximization. Let g = ∇ h R hinge

D

. Looking for an update of the form h := h + ∆h to reduce R hinge D (h), any direction such that ⟨g, ∆h⟩ < 0 will be a descent direction and a small step size guarantees consistent decrease of R hinge D . As we will show in the proof of Lemma A.1, directly applying gradient descent, i.e., ∆h = -η g, is not practical, as it incorporates summation of a distribution and a classifier vector. Unlike classifiers which are known for every point in the domain, in practice, distributions are limited to the available samples and this summation is not implementable. Alternatively, let E h = {x ∈ X | h(x)f (x) < 1} be the set of margin errors of classifier h. We task annotators to return D h = D| E h given h. Let h = A(D h ) be the model built on the vulnerabilities of h. Next lemma shows h is a descent direction for the hinge loss. Lemma A.1. For any hypothesis class H, true classifier f ∈ H, current classifier h ∈ H, ϵ-approximate risk minimizer A, and any underlying distribution D, the vector representation h of the classifier h = A(D| h(x)f (x)<1 ) is a descent direction for R hinge D (h).

See proof on page 22.

This lemma lets us write the updating rule h := h + η R hinge D (h) h, depicted graphically in Figure 4 . Since gradient dominance condition holds for this update and hinge loss is 1-Lipschitz, h will converge to f with the rate of O( |X | t 2 ). Although this method guarantees convergence, the dependence on the domain size makes the bound useless for continuous or very large domains.

A.2 MINIMIZING EXPONENTIAL LOSS

Another candidate function to upper-bound the zero-one risk is the exponential loss: l(x) = exp (-x) . This leads to a similar analysis as the AdaBoost algorithm (Schapire & Singer, 1999) . Plugging l(•) into Equation 6gives: R D (h) ≤ R exp D (h) = ⟨exp(-h • f ), D⟩, where exp(•) is element-wise exponential function. Similar to the hinge loss minimization, we show in the proof of Lemma A.2 that directly updating h with a gradient term is not implementable. So, we search in the hypothesis class for a classifier h such that h minimizes ⟨ h, g⟩, where g = ∇ h R exp D . Next lemma finds such a classifier along with the optimal step size. Lemma A.2. For any hypothesis class H, true classifier f ∈ H, current classifier h ∈ H, ϵ-approximate risk minimizer A, and any underlying distribution D, h = A(D h ) is the solution of min h∈H ⟨ h, g⟩. Here D h (x) ∝ D(x) exp(-h(x)f (x)) and g = ∇ h R exp D . Further, η = 1 2 log( 1 R D h ( h) -1) is the best step size for the update rule h := h + η h.

See proof on page 23.

Let h t be the final classifier obtained after t updates according to the updating rule of Lemma A.2. An analysis similar to the analysis of AdaBoost shows R D (h t ) ≤ exp(-(1-2ϵ) 2 t 2 ). This method, despite the exponential convergence rate, is not practical for two reasons. First, it is computationally hard as reweighting a distribution requires the calculation of a normalization factor which is a sum over the whole domain. Second, it requires sampling from D which might not be possible. In summary, gradient-based updates guarantee convergence of the updated classifier to the true classifier; however, they either suffer from slow convergence or computational hardness.

B MORE ON EXPERIMENTS

We elaborate on the details of datasets, models, and further observations from the simulation of path dynamic benchmarks in this section.

B.1 DATA AND MODELS

The CIFAR-10 dataset contains 60, 000 of 32 × 32 color images in 10 different classes, commonly used as a static benchmark for image classification. We use a shallow feed-forward Convolutional Neural Network (CNN) consisting of two convolution layers (with 32 and 64 filters), interleaved with two max-pooling layers, followed by a dropout layer and a dense layer with 10 units at the output, increasing total number of trainable parameters to 40k. We use the same architecture to train a base model and train classifiers in drawing rollouts. The base classifier achieves 73% accuracy after 30 epochs on the training data, reasonably above the chance level of about 10%. The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a popular NLI benchmark consisting of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. We restrict our simulation to a 50k random subset of this data. Our model is a slightly modified model of Bowman et al. (2015) where words are represented with pre-trained 100d GloVefoot_0 vectors, and two parallel Bidirectional Long Short-Term Memory layers (BiLSTM) are used to extract sentence embeddings. The concatenated embeddings are then fed into three more hidden dense layers (each has 128 units) before going to the last dense layer with 3 units. The model has a total number of 120k trainable parameters, and the base model achieved an accuracy of 68% on the training data and 61% on the test data, consistent with the original study when the number of samples was limited.

B.2 FURTHER OBSERVATIONS

The observations we had in Section 5 raise another question: what makes a rollout a bad rollout? Do bad rollouts share a common characteristic? We hypothesize the more similar a rollout to the bad sequence constructed in the proof of Theorem 3.5, the worse its final risk. A unique feature of the negative example in Theorem 3.5 is that initial classifiers in the sequence cleverly choose their error sets to only overlap on a fixed part of the distribution, which will turn out to be the error set of the majority. We define a score that captures this behavior without being directly related to the accuracy of the final majority vote model. Let E m be the error set for the majority vote classifier and E ht be the error set of the round t classifier. Then for every pair of distinct rounds t 1 , t 2 < T of a rollout, define z t1,t2 = quantify similarity of a rollout to the theoretical negative example. The proof of Theorem 3.5 shows that as long as T ≤ 2 ϵ , the theoretical negative example gives z T = 1. Figure 5a depicts the risk of the majority vote at the end of the rollout (round 25) vs. z 4 score. The correlation is positive and statistically significant (Pearson r = 0.46, p < 0.001). So, the more similar a rollout to the theoretical bad example in terms of the defined score, the more likely it will be a bad rollout. Interestingly, at round 4, the risk of the majority vote classifier can barely explain the risk after 25 rounds (Pearson r = 0.18, p > 0.05). This shows our negative example construction has not only introduced a worst-case example but might have identified an important characteristic that naturally appears in dynamic benchmarks and limits their effectiveness. |E h t 1 ∩E h t 2 ∩Em| |E h t 1 ∩E h t 2 | . We use z T = 1 T (T -1) T -1 t1=0 T -1 t2=0,t2̸ =t1 z t1, Figure 5b also shows a significant positive correlation (Pearson r = 0.68, p < 0.001) between the final risk of a rollout and its similarity to the theoretical negative example measured early at round 6. At this round, the risk of the current majority vote classifier is barely informative about the risk after 20 rounds (Pearson r = -0.27, p > 0.05). Once more, it shows the negative example constructed theoretically can explain the natural failure of dynamic benchmarking in different contexts.

C ADDITIONAL STATEMENTS AND PROOFS

Corollary C.1. Under conditions of Lemma 3.1, let ĥ be one of the h t s selected uniformly at random. For any δ > 0 and α > 0, if T ≥ 1 δα , with probability at least 1 -δ, ĥ is α-accurate. Proof. From Lemma 3.1, the probability that ĥ is α-bad is less than 1/α T ≤ δ. Proof of Lemma 3.1. As supp(D 0 ) ⊆ supp(D), running the path dynamic benchmarking will preserve supp(D t ) ⊆ supp(D) for any t. On the other hand, we know R D (f ) = 0. So, we have R Dt (f ) = 0. As A is a perfect risk minimizer and f ∈ H, we should have R Dt (h t ) = 0. Therefore, errors of h t should happen outside of supp(D t ): E t ∩ supp(D t ) = 0, where E t is the error set of h t . Now assume t is one of the indices where h t is α-bad, i.e., Pr D (E t ) > α. Then we have Pr D (x ∈ supp(D t+1 )) = Pr D (x ∈ supp(D t )) + Pr D (x ∈ E t ) ≥ Pr D (x ∈ supp(D t )) + α. This cannot happen more than 1 α times. Proof of Theorem 3.4. Starting from D 0 where supp(D 0 ) ⊆ supp(D), path dynamic benchmarks will preserve supp(D t ) ⊆ supp(D) for any t. Since the problem is realizable and A is ϵapproximate risk minimizer, for any consistent h t we should have R Dt (h t ) ≤ ϵ. The distribution D t itself is a mixture of the initial and error distributions: D t = mix(D 0 , D 0 , • • • , D t-1 ). So, we can expand R Dt (•) as a linear combination of the risks under mixture components. Note that the risk of h t under an error distribution D t ′ , where t ′ < t, can equivalently be written as R D t ′ (h t ) = Pr D (E t |E t ′ ). For t = 0, 1, 2 we have: R D0 (h 0 ) ≤ ϵ 1 2 R D0 (h 1 ) + 1 2 Pr D (E h1 |E h0 ) ≤ ϵ 1 3 R D0 (h 2 ) + 1 3 Pr D (E h2 |E h0 ) + 1 3 Pr D (E h2 |E h1 ) ≤ ϵ. The above constraints impose R D0 (h t ) = Pr D0 (E t ) ≤ (t + 1)ϵ and Pr D (E t |E t ′ ) ≤ (t + 1)ϵ. However, to limit the joint error probability of two classifiers we need to bound Pr D (E t ): Pr D (E t ) = R D (h) ≤ R D0 (h) + d H∆H (D 0 , D). Finally, by a union bound, the error probability of maj(h 0 , h 1 , h 2 ) is less than the sum of all pairwise joint error probabilities. Putting these all together: R D maj(h 0 , h 1 , h 2 ) ≤ Pr D (E h0 ∩ E h1 ) + Pr D (E h0 ∩ E h2 ) + Pr D (E h1 ∩ E h2 ) ≤ 11ϵ 2 + 8ϵd H∆H (D 0 , D). Proof of Theorem 3.5. For a given domain X , we define a new finite domain X d ∈ X d such that X d can be shattered by H. For an arbitrary but fixed order on X d , we can define equivalent vector representations of functions and distributions in the new domain. Particularly, let h : X → {-1, 1} be a binary classifier defined on X . The vector representation of h in X d is h = (h(x)) x∈X d ∈ {-1, 1} d . Similarly, let P be a probability distribution over X . The vector representation of P in X d is P = (P(x)) x∈X d ∈ ∆(X d ). Throughout the proof, we use K to show a subset of X d with k elements. We also use P(K) as a shorthand for x∈K P(x). Throughout the proof we use interval notation for discrete intervals. For example, [a, b) denotes {a, a + 1, . . . , b -1}. Finally, h 1 • h 2 denotes the entrywise product of h 1 and h 2 . The four allowed operations on classifiers and distributions have new interpretations in the new domain: 1. D t A -→ h t : The fact that A is an ϵ-approximate risk minimizer imposes the following constraint: ⟨h t • f , D t ⟩ = ht(x)f (x)=1 D t (x) - ht(x)f (x)=-1 D t (x) = 1 -2R Dt (h t ) ≥ 1 -2ϵ. (13) 2. h t H -→ D t : Let K t be the set of indices that h t and f disagree. Then, it is straightforward to see D t = 1 D(K t ) (D -h t • f • D). ( ) 3. (h 0 , h 1 , • • • , h t ) maj --→ ĥ: This is equivalent to per dimension voting: ĥ (x) = maj(h 0 (x), h 1 (x), • • • , h t (x)) for x ∈ X d . 4. (D 0 , D 0 , D 1 , • • • , D t-1 ) mix --→ D t : This is simply equivalent to the weighted sum of D t = w t,0 D 0 + t ′ <t w t,t ′ D t ′ , where ws are the weights of the mixture components, summing up to 1. Published as a conference paper at ICLR 2023 Note that in all of the operations h t and f have appeared only as h t • f . Since X d can be shattered by H, without loss of generality, we assume f is all one and use h t instead of h t •f in the following. Next, we start from the initial distribution D 0 as the input to the path dynamic benchmarks and find the constraints imposed over variables throughout the process: • D 0 A -→ h 0 : In this case ⟨h 0 , D 0 ⟩ = 1 -2D 0 (K 0 ) and Equation 13requires: D 0 (K 0 ) ≤ ϵ. • D t A -→ h t : For D t defined in Equation 15, we have: ⟨h t , D t ⟩ = w t,0 ⟨h t , D 0 ⟩ + t ′ <t w t,t ′ ⟨h t , D t ′ ⟩, where ⟨h t , D 0 ⟩ = 1 -2D 0 (K 0 ) and ⟨h t , D t ′ ⟩ = 1 2D(K t ′ ) ⟨h t , D⟩ -⟨h t • h t ′ , D⟩ = 1 2D(K t ′ ) 1 -2D(K t ) -1 -2D(K t ∪ K t ′ -K t ∩ K t ′ ) = 1 - D(K t ∩ K t ′ ) D(K t ′ ) . ( ) Here we used 17and requiring ⟨h t , D t ⟩ ≥ 1 -2ϵ from Equation 13, impose D(K t ∪ K t ′ -K t ∩ K t ′ ) = D(K t ) + D(K t ′ ) -2D(K t ∩ K t ′ ). Plugging ⟨h t , D t ′ ⟩ into Equation w t,0 D 0 (K t ) + t ′ <t w t,t ′ D(K t ∩ K t ′ ) D(K t ′ ) ≤ ϵ. In the following, we propose K t s such that constraints of Equations 16 and 19 are satisfied. These are the necessary and sufficient conditions to have h t s consistent with path dynamic benchmarks. For 0 ≤ t < T = 2 ϵ : Consider K t s such that K t ∩ K t ′ = K for any t ′ < t. Sufficient conditions to satisfy Equations 16 and 19 are K t ∩ K t ′ = K (20) D(K) D(K t ) = k k t ≤ ϵ 2 (21) D 0 (K t ) ≤ ϵ 2 at every time step. Let's reorder axes in an ascending order of D 0 and name them from 1 to d. Also assume this results in an ascending order of D as well. For example, D = unif(X d ) always satisfies this. We set K = [1, k] and K 0 = [1, k ′ ] where k ′ = 2k ϵ . Then for 1 ≤ t < T = 2 ϵ , we set K t = K ∪ (k ′ + (t -1)(k ′ -k), k ′ + t(k ′ -k)]: K : k ← → K 0 : k ′ = 2k ϵ ← --------→ K 1 : k ← → k ′ -k ←------→ K 2 : k ← → k ′ -k ←------→ . . . It is straightforward to see that this assignment satisfies Equations 20 and 21. Since axes are ordered ascendingly according to D 0 , a sufficient condition to satisfy Equation 22 for all rounds is D 0 (K T -1 ) ≤ D 0 k ′ + (T -1)(k ′ -k) k ′ ≤ D 0 (T k ′ )k ′ = D 0 2k ′ ϵ k ′ ≤ ϵ 2 . ( ) For y = 2k ′ ϵ , the above condition is D 0 (y) ≤ 1 y . Again, since axes are ordered in an ascending order of D 0 , D 0 (y) ≤ 1 d-y+1 (otherwise, the sum of D 0 elements will go above 1).  v t,τ = w t,τ + T ≤t ′ <t w t,t ′ 1{ϕ(t ′ ) = τ }. Then ϕ(•) assigns round t according to ϕ(t) = arg min 0≤τ <T v t,τ . The sum of the updated weights satisfies 0≤τ <T v t,τ = 0≤τ <T w t,τ + T ≤t ′ <t w t,t ′ 0≤τ <T 1{ϕ(t ′ ) = τ } = 0≤t ′ <t w t,t ′ ≤ 1, where we used the identity 0≤τ <T 1{ϕ(t ′ ) = τ } = 1. This guarantees v t,ϕ(t) ≤ 1 T = ϵ 2 . ( ) Plugging h t = h ϕ(t) into Equation 19gives: w t,0 D 0 (K ϕ(t) ) + 0≤τ <T w t,τ D(K) D(K τ ) + T ≤t ′ <t w t,t ′ D(K) D(K ϕ(t ′ ) ) = w t,0 D 0 (K ϕ(t) ) + w t,ϕ(t) + T ≤t ′ <t ϕ(t ′ )=ϕ(t) w t,t ′ + 0≤τ <T τ ̸ =ϕ(t) w t,τ D(K) D(K τ ) + T ≤t ′ <t ϕ(t ′ )̸ =ϕ(t) w t,t ′ D(K) D(K ϕ(t ′ ) ) ≤ w t,0 + 0≤τ <T τ ̸ =ϕ(t) w t,τ + T ≤t ′ <t ϕ(t ′ )̸ =ϕ(t) w t,t ′ ϵ 2 + v t,ϕ(t) ≤ ϵ Here we used the following observations. Since h τ satisfies D 0 (K τ ) ≤ ϵ 2 for all τ < t, h t also satisfies D 0 (K t ) = D 0 (K ϕ(t) ) ≤ ϵ 2 . Also all the ratios D(K) D(Kτ ) and D(K) D(K ϕ(t ′ ) ) are bounded by ϵ 2 according to Equation 21. We finally applied Equation 28 to bound v t,ϕ(t) . This completes our argument that h t is consistent for t ≥ T . So far, we have introduced a sequence of classifiers (h t ) t which are consistent with path dynamic benchmarks and all misclassify K. We also showed for the selection of d = 8 ϵ 2 , we have k d = ϵ 2 8 . The only required assumption is that the element of D 0 and D are both ascending. For the special case of D = unif(X d ), D(K) = k d . Since elements of K are misclassified by all the classifiers, no matter how the majority vote of them is calculated, K will be in the error set: R D maj(h 0 , h 1 , • • • ) ≥ D(K) = k d = ϵ 2 8 . This completes the theorem. Proof of Theorem 3.6. Let D δ t = D t | x∈X δ and D δ t = D t | x / ∈X δ . Also let δ t = Pr Dt (x ∈ X δ ). So, we can write D t as a mixture of its restricted distributions: D t = δ t D δ t +(1-δ t )D δ t . The classifier h t trained on D t should satisfy R Dt (h t ) = δ t R D δ t (h t ) + (1 -δ t )R D δ t (h t ) = δ t 2 + (1 -δ t )R D δ t (h t ) ≤ min h∈H R Dt (h) + ϵ = δ t 2 + ϵ, These equations let us bound R D (g 1 ) and Pr D (E g1 |E g0 ): R D (g 1 ) ≤ Pr D (E h3 ∩ E h4 ) + Pr D (E h3 ∩ E h5 ) + Pr D (E h4 ∩ E h5 ) ≤ 3ϵ(2ϵ + ∆ 0 ) + 4ϵ(2ϵ + ∆ 0 ) + 4ϵ(3ϵ + ∆ 0 ) = 26ϵ 2 + 11ϵ∆ 0 , and Pr D (E g1 |E g0 ) ≤ Pr D (E h3 |E g0 ) + Pr D (E h4 |E g0 ) + Pr D (E h5 |E g0 ) ≤ 2ϵ + 3ϵ + 4ϵ = 9ϵ. Finally, for the third path dynamic benchmarking step (h 6 → h 7 → h 8 ) we have:  1 3 R D0 (h 6 ) + 1 3 Pr D (E h6 |E g0 ) + 1 3 Pr D (E h6 |E g1 ) ≤ ϵ (44) 1 4 R D0 (h 7 ) + 1 4 Pr D (E h7 |E g0 ) + 1 4 Pr D (E h7 |E g1 ) + 1 4 Pr D (E h7 |E h6 ) ≤ ϵ (45) 1 5 R D0 (h 8 ) + 1 5 Pr D (E h8 |E g0 ) + 1 5 Pr D (E h8 |E g1 ) + 1 5 Pr D (E h8 |E h6 ) + 1 5 Pr D (E h8 |E h7 ) ≤ ϵ. ( and a similar calculation shows Pr D (E g2 |E g1 ) ≤ 12ϵ. Putting these all together, we can bound the risk of maj(g 0 , g 1 , g 2 ) by R D (maj(g 0 , g 1 , g 2 )) ≤ Pr D (E g0 ∩ E g1 ) + Pr D (E g0 ∩ E g2 ) + Pr D (E g1 ∩ E g2 ) ≤ R D (g 0 ) Pr D (E g1 |E g0 ) + R D (g 0 ) Pr D (E g2 |E g0 ) + R D (g 1 ) Pr D (E g2 |E g1 ) ≤ 9ϵ(11ϵ 2 + 8∆ 0 ϵ) + 12ϵ(11ϵ 2 + 8∆ 0 ϵ) + 12ϵ(26ϵ 2 + 11∆ 0 ϵ) = 543ϵ 3 + 300ϵ 2 ∆ 0 , which completes the proof. Proof of Theorem 4.2. We follow the proof of Theorem 3.5 and define equivalent vector representations of functions and distributions in a new finite domain X d ∈ X d such that X d can be shattered by H (so, d ≤ 2 ϵ 3 ). We also name variables according to Figure 2 . For a true classifier f , K t denotes the set of indices where h t and f disagree. Similarly, K gt is the set of indices where f and g t disagree. First of all, we reorder axes in an ascending order of D 0 . Also assume this results in an ascending order of D as well. For example, D = unif(X d ) always satisfies this. Naming axes from 1 to d and showing them on the horizontal line, we assign K t s and K gt s according to: K 0 : 1 ϵ 2 ← --------------→ K 1 : 1 ϵ ← ---→ 1 ϵ 2 -1 ϵ ← --------→ K 2 : 1 ϵ ← ---→ 1 ϵ 2 -1 ϵ ← --------→ K g0 : 1 ϵ ← ---→ K 3 : 1 ← → 1 ϵ 2 -1 ← ------------→ K 4 : 1 ← → 1 ϵ -1 ←-→ 1 ϵ 2 -1 ϵ ← --------→ K 5 : 1 ← → 1 ϵ -1 ←-→ 1 ϵ 2 -1 ϵ ← --------→ K g1 : 1 ← → 1 ϵ -1 ←-→ K 6 : 1 ← → 1 ϵ 2 -1 ← ------------→ K 7 : 1 ← → 1 ϵ -1 ←-→ 1 ϵ 2 -1 ϵ ← --------→ K 8 : 1 ← → 1 ϵ -1 ←-→ 1 ϵ 2 -1 ϵ ← --------→ K g2 : 1 ← → 1 ϵ -1 ←-→ Next, we discuss each path dynamic benchmarking step and find sufficient conditions under which K t s are consistent with the dynamic benchmarking routine.

1.. For the first path dynamic benchmarking step (h

0 → h 1 → h 2 ), let K 0 = [1, 1 ϵ 2 ], K 1 = [1, 1 ϵ ] ∪ ( 1 ϵ 2 , 2 ϵ 2 -1 ϵ ], and K 2 = [1, 1 ϵ ] ∪ (2 1 ϵ 2 -1 ϵ , 3 1 ϵ 2 -2 1 ϵ ]. We have K g0 = K 0 ∩ K 1 = K 0 ∩ K 2 = K 1 ∩ K 2 = [1, 1 ϵ ] D(K g0 ) D(K 1 ) ≤ D(K g0 ) D(K 0 ) ≤ ϵ. ( ) With a similar reasoning as the proof of Theorem 3.5, the only remaining condition to ensure h 0 , h 1 , and h 2 are consistent with path dynamic benchmarks, regardless of the weightings, is D 0 (K 2 ) ≤ ϵ. The output of this step has the error set of K g0 = [1, 1 ϵ ] no matter how the weighted majority vote is calculated. 2. For the second path dynamic benchmarking step (h 3 → h 4 → h 5 ), let K 3 = [1] ∪ ( 1 ϵ , 1 ϵ 2 + 1 ϵ -1], K 4 = [1] ∪ ( 1 ϵ , 2 ϵ -1] ∪ ( 1 ϵ 2 + 1 ϵ -1, 2 ϵ 2 -1], and K 5 = [1] ∪ ( 1 ϵ , 2 ϵ -1] ∪ ( 2 ϵ 2 - 1, 3 ϵ 2 -1 ϵ -1]. We have K g1 = K 3 ∩ K 4 = K 3 ∩ K 5 = K 4 ∩ K 5 = [1] ∪ ( 1 ϵ , 2 ϵ -1] (51) D(K g1 ) D(K 4 ) ≤ D(K g1 ) D(K 3 ) ≤ ϵ (52) D(K 3 ) D(K g0 ) = D(K 4 ) D(K g0 ) = D(K 5 ) D(K g0 ) ≤ ϵ. ( ) It is easy to check that for any weighting of mixture components, if D 0 (K 5 ) ≤ ϵ, h 3 , h 4 , and h 5 will be consistent with path dynamic benchmarks and the output of this step has the error set of K g1 = [1] ∪ ( 1 ϵ , 2 ϵ -1]. Note that D 0 (K 5 ) ≤ ϵ guarantees D 0 (K 2 ) as well. 3. For the third path dynamic benchmarking step (h 6 → h 7 → h 8 ), let K 6 = [1] ∪ ( 2 ϵ - 1, 1 ϵ 2 + 2 ϵ -2], K 7 = [1] ∪ ( 2 ϵ -1, 3 ϵ -2] ∪ ( 1 ϵ 2 + 2 ϵ -2, 2 ϵ 2 + 1 ϵ -2], and K 8 = [1] ∪ ( 2 ϵ -1, 3 ϵ -2] ∪ ( 2 ϵ 2 + 1 ϵ -2, 3 ϵ 2 -2]. We have K g2 = K 6 ∩ K 7 = K 6 ∩ K 8 = K 7 ∩ K 8 = [1] ∪ ( 2 ϵ -1, 3 ϵ -2] (54) D(K g2 ) D(K 7 ) ≤ D(K g2 ) D(K 6 ) ≤ ϵ (55) D(K 6 ) D(K g0 ) = D(K 7 ) D(K g0 ) = D(K 8 ) D(K g0 ) ≤ ϵ (56) D(K 6 ) D(K g1 ) = D(K 7 ) D(K g1 ) = D(K 8 ) D(K g1 ) ≤ ϵ. Again, it is straightforward to see that for any weighting, if D 0 (K 8 ) ≤ ϵ, h 6 , h 6 , and h 8 will be consistent with path dynamic benchmarks and the output of this step has the error set of K g2 = [1] ∪ ( 2 ϵ -1, 3 ϵ -2]. Note that D 0 (K 8 ) ≤ ϵ guarantees D 0 (K 5 ) as well. Putting these all together, to make the provided classifiers consistent with hierarchical dynamic benchmarking, it suffices to show D 0 (K 8 ) ≤ ϵ: D 0 (K 8 ) ≤ D 0 ( 3 ϵ 2 -2) 1 ϵ 2 - 1 ϵ ≤ ϵ. Since axes are ordered in an ascending order of D 0 , D 0 (x) ≤ 1 d-x+1 . Otherwise, the sum of the D 0 elements would go above 1. So, we have: 1 d -( 3 ϵ 2 ) + 1 1 ϵ 2 - 1 ϵ ≤ ϵ which imposes d ≥ 1 ϵ 3 + 2 ϵ 2 -1. For any valid ϵ, i.e. ϵ ≥ 1 2 , d = 2 ϵ 3 satisfies this condition. It also satisfies the other limitation we had on VCdim(H). In this case R D maj(g 0 , g 1 , g 2 ) ≥ D(K g0 ∩ K g1 ∩ K g2 ) = D([1]). (60) For the choice of D = unif(X d ), this risk is 1 d = ϵ 3 2 which completes the proof. Proof of Lemma A.1. The derivative of R hinge D (h) w.r.t h(x) is: g(x) := ∂R hinge D ∂h(x) = 0 h(x)f (x) ≥ 1 -f (x)D(x) o.w. . We can write g with vector representations as g = -Pr D (E h ) (f • D h ) = -R hinge D (h) (f • D h ). As mentioned earlier, updates of the form h := h -ηg are infeasible as in practice D h is only available at limited samples. To make it practical, let h = A(D h ). As A is an ϵ-approximate risk minimizer, R D h (h) = 1 2 ⟨1 -h • f , D h ⟩ = 1 2 - 1 2 ⟨h, f • D h ⟩ ≤ ϵ. So, we have ⟨h, f • D h ⟩ ≥ 1 -2ϵ. For ϵ < 0.5, h will be a descent direction.



http://nlp.stanford.edu/data/glove.6B.zip



Figure 1: Example of a path dynamic benchmark. Symbol A represents risk minimization, H represents human data collection, and ⊕ shows distribution mixing.

Figure 2: Depth-2 width-3 hierarchical dynamic benchmark. Symbols as in Figure 1 with M representing majority vote.

Figure 3: Path dynamic benchmark is simulated. The base model is kept fixed for all rollouts. The y-axis shows the risk of the majority vote of all classifiers obtained until that round.

Figure 4: Dynamic gradient-based update: hinge loss minimization.

Figure 5: Similarity to the theoretical negative example measured by z T can partially explain the risk of the majority vote classifier of all rounds. The blue dashed line is the minimum squared error fit.

Then simple calculations show d ≥ 2y = 8k ϵ 2 is sufficient to hold Equation 23. Since k is an integer, this is possible for d ≥ 8 ϵ 2 which requires VCdim(H) ≥ 8 ϵ 2 . For d = 8 ϵ 2 , this gives k d = ϵ 2 8 . For t ≥ T : We set h t = h ϕ(t) where ϕ: [T, ∞) → [0, T -1]is an assignment function working as follows. We define updated weights for 0 ≤ τ < T : v t,0 = w t,0

This lets us bound Pr D (E g2 |E g0 ) and Pr D (E g2 |E g1 ):

annex

which requires (1 -δ t )R D δ t (h t ) ≤ ϵ. Here we used the fact that labels of X δ are equiprobable to be 1 or -1. So, risk of any classifier under D δ t is 1 2 . The inequality comes from A being ϵ-approximate risk minimizer and the last equality is due the realizability of D δ t . The error distribution of h t iswhere, and δ = Pr D (x ∈ X δ ). Note that again we used the fact that labels of X δ are random. SinceSo, the weight of 32) will be greater than or equal to (1 + 2(t + 1) ϵ δ 1-δ 1-δt ) -1 . Then the total weight given to X δ by distribution D t+1 = t+1 t+2 D t + 1 t+2 D t will beThe first observation from the above equation iswhere we used δ 0 = δ. For sufficiently large δ ϵ , δ 1 goes above 1 2 which means more than half of the benchmark will be focused on the unrealizable instances. Next, we show this is not limited to the first round, and δ t maintains a large proportion of D t as we progress. Recursively expanding Equation 34 and assuming δ t ≤ 2:where a = 4 ϵ δ (1 -δ). We applied ln(1 + x) ≥ x 1+x to obtain the last inequality. This shows the upperbound cannot decrease faster than Ω( 1 1+a t ). For t ≥ 1, we have. This completes the proof.Proof of Theorem 4.1. We use the same notation as Figure 2 and follow similar arguments as the proof of Theorem 3.4. Also for notation convenience we use the shorthand ∆ 0 to denote d H∆H (D 0 , D).Starting from the first path dynamic benchmarking step (h 0 → h 1 → h 2 ), in Theorem 3.4 we observed: Now we can bound R D (g 0 ) byFor the second path dynamic benchmarking step (h 3 → h 4 → h 5 ) we have:or in the vector spaceTo find a good descent direction, we solve forThe solution to this problem is the minimizer of R D h ( h) where D h is the true distribution weighted according to) and Z h is a normalization factor to make sure D h will be a probability distribution. Let h = A(D h ). Given A is ϵ-approximate risk minimizer,For ϵ < 0.5, h will be a descent direction and the updating rule isFinally, in order to find the optimum step size η, we plug the updated h into Equation 8and solve for minSince the objective is a convex function of η, it suffices to check the first order condition:Solving the last equation givesTheorem C.2. For any ϵ-approximate risk minimizer A with 1 ϵ ∈ N, and path dynamic benchmark with L ≥ 3 rounds of model building and arbitrary mixture weights, there exists a hypothesis class H with VCdim(H) ≤ 4 such that for any true classifier f ∈ H, there exists an underlying distribution D where for any initial distribution D 0 with supp(D 0 ) ⊆ supp(D) that satisfies D(x 2 ) ≥ D(x 1 ) ⇐⇒ D 0 (x 2 ) ≥ D 0 (x 1 ), there exists a sequence (h t ) T -1 t=0 of classifiers consistent with the path dynamic benchmark where a similar lower-bound as Theorem 3.5 holds.Proof. The proof is straightforward given the results obtained so far in the proof of Theorem 3.5. The way we constructed K t s is equivalent to selecting two subsets of R + and changing their predicted labels. So, let H ′ be the class of such functions, i.e., the class of unions of two intervals (VCdim(H ′ ) = 4). For any f ∈ H ′ , we select d points from the real line such that f (x) is all 1 or all -1. Let X d ∈ R d be the set of the selected points. Then we induce a probability distribution D on X d such that D(x) = 1 d + α(x) where α(x) is an ascending function of x and x∈X d α(x) = 0. Then all of our arguments in the first part of the proof will be true for any D 0 which is ascending or descending w.r.t x. In the limit we haveThis completes the proof.Published as a conference paper at ICLR 2023 Theorem C.3. For any ϵ-approximate risk minimizer A with 1 ϵ ∈ N, and hierarchical dynamic benchmark with depth-2 and width-3 (Figure 2 ) with arbitrary mixture and majority weights, there exists a hypothesis class H with VCdim(H) ≤ 6 such that for any true classifier f ∈ H, there exists an underlying distribution D where for any initial distribution D 0 with supp(D 0 ) ⊆ supp(D) that satisfies D(x 2 ) ≥ D(x 1 ) ⇐⇒ D 0 (x 2 ) ≥ D 0 (x 1 ), there exists classifiers consistent with the hierarchical dynamic benchmark for which a similar lower-bound as Theorem 4.2 holds.Proof. Given the results obtained in the proof of Theorem 4.2, the proof of this theorem is pretty straightforward. The way we chose K t s in the proof of Theorem 4.2 is equivalent to choosing three intervals from the real line. So, let H ′ be such class of function (VCdim(H ′ ) = 6). For any f ∈ H ′ , we select d points from the real line such that f (x) is all 1 or all -1. Let X d ∈ R d be the set of the selected points. Then we induce a probability distribution D on X d such that D(x) = 1 d + α(x) where α(x) is an ascending function of x and x∈X d α(x) = 0. All of our arguments in the first part of the proof hold for any D 0 which is ascending or descending w.r.t x. In the limit we havewhich completes the proof.

