EFFICIENTLY CONTROLLING MULTIPLE RISKS WITH PARETO TESTING

Abstract

Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyperparameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiably high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes-including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered-to simultaneously control and optimize various accuracy and cost metrics.

1. INTRODUCTION

Suppose you want to deploy a modern machine learning model in a real-world environment. As a practitioner, you may frequently have to weigh several performance considerations (Jin & Sendhoff, 2008; Ribeiro et al., 2020; Min et al., 2021) . For example, how much computational budget can you spend? What accuracy do you require? How large, if any, of a discrepancy in predictive performance across different groups of end-users can you tolerate? Often models are equipped with hyper-parameter configurations that provide "knobs" for tuning different aspects of their performance, depending on how such questions are answered. As the number of parameter dimensions and objectives grow, however, choosing the right set of parameters to rigorously control model performance on test data in the intended ways can become prone to error. To address this challenge, the recently proposed Learn Then Test (LTT) framework of Angelopoulos et al. (2021) combines any type of parameterizable predictive model with classic statistical hypothesis testing to provide an algorithm for selecting configurations that lead to provable distribution-free, finite-sample risk control of any user-specified objective. Nevertheless, while theoretically general, a key pair of practical challenges arises when the space of parameters to explore and constraints to satisfy are large. The first is that evaluating all possible configurations can quickly become intractable, while the second is that the statistical tests relied upon to guarantee risk control can quickly lose power-and fail to identify configurations that are also useful for the task at hand. In this work, we build upon the results of LTT by introducing Pareto Testing, a simple procedure that can provide a computationally and statistically efficient way to identify valid, risk-controlling configurations with (specifiably) high probability, which, critically, are also useful with respect to other objectives of interest. Our method consists of two stages. In the first stage, we solve an unconstrained, Figure 1 : A demonstration of our calibration procedure applied to multi-dimensional adaptive computation in a Transformer model (left). Here we have the option to drop tokens from the input, make an "early-exit" prediction after processing a subset of the layers, or only compute a subset of the self-attention heads in each layer in order to do faster inference. Our calibration procedure (right) applies multi-objective optimization to identify a Pareto frontier of configurations with different performance profiles, and then applies statistical testing to efficiently identify a subset of "risk-controlling" configurations with high probability (e.g., bounded error rates). multi-objective optimization problem in order to recover an approximate set of Pareto-optimal configurations, i.e., settings for which no other configuration exists that is uniformly better in all respects. Here we can exploit standard multi-objective optimization methods to efficiently explore and filter large parameter spaces to only its most promising configurations. In the second stage, we perform rigorous sequential testing over the recovered set, which we empirically find to yield tight control of our desired risks, while also giving good performance with respect to our free objectives. 1We apply our approach to adaptive computation in large-scale Transformer models (Vaswani et al., 2017) for natural language processing (NLP), see Figure 1 . While larger models generally perform better, they can also be incredibly computationally intensive to run (Bapna et al., 2020; Schwartz et al., 2020; Moosavi et al., 2021) . Often, however, not every application, domain, or example requires the same amount of computation to achieve similar performance. As such, many techniques have been proposed for accelerating computation, including attention head pruning, token dropping, or early exiting (Graves, 2016; Xin et al., 2020; Hou et al., 2020; Goyal et al., 2020) . Still, determining the extent to which to apply different modifications while still preserving good performance can be tricky. Our proposed procedure allows the user to jointly configure multiple model settings subject to multiple statistical guarantees on model performance-such as average and worst-case relative reductions in accuracy (e.g., so that the adaptive model is within 5% of the full model's accuracy), average inference cost (e.g., so that the adaptive model uses less than a certain number of FLOPS on average), or maximum abstention rates in selective prediction settings. Contribution. The core idea and contribution of our work can be summarized quite plainly: 1. Our framework leverages statistical testing techniques via the LTT framework (Angelopoulos et al., 2021) to identify valid risk-controlling hyper-parameter configurations; 2. To improve efficiency, we introduce Pareto Testing, our main contribution, as a way to efficiently guide the number and order of configurations that we test when searching for valid settings; 3. We demonstrate the scalability and effectiveness of our method in managing trade-offs in multidimensional adaptive computation in NLP applications with large-scale Transformer models; 4. On diverse text classification tasks, we empirically achieve tight, simultaneous control of multiple risks while also improving performance on any non-controlled objectives, relative to baselines.

2. RELATED WORK

Risk control. Our work adds to a rich history of tools for uncertainty estimation and risk control for machine learning algorithms (Vovk, 2002; Vovk et al., 2015; 2017; Lei et al., 2013; 2018; Gupta et al., 2020; Bates et al., 2021; Barber et al., 2021; Angelopoulos et al., 2021) . Here we focus on achieving model-agnostic, distribution-free, and finite-sample performance guarantees-similar to the coverage guarantees given by prediction sets or regression intervals in conformal prediction (Papadopoulos et al., 2002; Vovk et al., 2005; Angelopoulos et al., 2022) . As outlined in §1, this paper builds on the groundwork set by Angelopoulos et al. (2021) , which provides a general methodology for calibrating any risk function that is controllable via some low-dimensional hyper-parameter configuration. We extend their framework to efficiently handle (relatively) higher-dimensional settings with multiple auxiliary objectives. Our application to confident model acceleration is also closely related to Schuster et al. (2021; 2022) , though our method is designed for a much broader setting that involves (i) multiple objectives, and (ii) multiple model pruning dimensions. Multi-objective optimization. Solving for multiple objectives is a fundamental problem (Deb, 2001; Miettinen, 2012; Bradford et al., 2018) . Typically, multi-objective problems are more difficult than single-objective problems, as a single solution does not always exist due to trade-offs. Instead, there is a set of solutions that are all equally "good", which is known as the Pareto frontier (Censor, 1977; Arora, 2004) . Our setting falls at the intersection of multi-objective optimization and risk-control, where we want to perform multi-objective optimization subject to statistical bounds on a subset of the objectives. Our two-stage approach is able to directly combine techniques in multi-objective optimization (Knowles, 2006; Lindauer et al., 2022) with those in risk control (Angelopoulos et al., 2021) , in order to identify valid, statistically efficient solutions. Model configuration. We approach our multi-objective optimization problem by uncovering model configurations that deliver on the desired performance guarantees (e.g., bounded error rates), while also providing "best-effort" optimization of the auxiliary objectives (e.g., minimal inference cost) without any re-training. This is adjacent to the field of hyper-parameter tuning and architecture search, which deals with determining appropriate model hyper-parameter values, or even designing higher-level network structures (Elsken et al., 2019) . While most approaches focus on finding configurations that maximize predictive performance, some have also considered additional measures such as efficiency (Shah & Ghahramani, 2016; Belakaria et al., 2019; Elsken et al., 2018; Dong et al., 2018; Zhou et al., 2018; Chu et al., 2020) , fairness (Schmucker et al., 2020; Candelieri et al., 2022) , or robustness (Karl et al., 2022) . Our work, however, differs by treating hyper-parameter selection as a multiple-testing problem with rigorous statistical guarantees following Angelopoulos et al. (2021) . Adaptive computation. Our main application is configuring adaptive model computation. Largescale deep learning models can be accurate, but also computationally intensive to run. Many efforts have been focused on improving run-time efficiency, including model distillation (Sanh et al., 2019; Jiao et al., 2020; Sun et al., 2020) , dynamic architecture selection (Yu et al., 2019; Cai et al., 2020; Hou et al., 2020 ), early-exiting (Teerapittayanon et al., 2016; Liu et al., 2020) , token pruning (Goyal et al., 2020; Ye et al., 2021; Kim et al., 2021; Modarressi et al., 2022; Guan et al., 2022) , and others. This work focuses on configurable, adaptive computation that does not require model re-training. Furthermore, only a few methods have proposed combining multiple pruning dimensions, such as depth and width (Hou et al., 2020) , token pruning with early-exiting (He et al., 2021) , and pruning model units of different granularity (Xia et al., 2022) . Our multi-dimensional calibration scheme generalizes these approaches, and allows for flexible tuning in each pruning axis.

3. PROBLEM FORMULATION

Consider an input variable X ∈ X , and an associated label Y ∈ Y, drawn from some joint distribution. Assume a predictive model of the form f : X × Λ → Y, where Λ ≜ Λ 1 × . . . × Λ n is the space of n hyper-parameters (λ 1 , . . . , λ n ) that configure f . The parameters of the model f are optimized over a training set D train , while the hyper-parameters then provide n additional degrees of freedom in influencing either (i) how the model is trained over D train , or (ii) how the model is used. We focus on the latter in this paper. For example, in our adaptive Transformer example, f has n = 3 pruning dimensions: the number of attention heads per layer, the truncated length of the input text sequence, and the effective network depth in terms of the selected early-exit layer. In this scenario, the hyper-parameters (λ 1 , λ 2 , λ 3 ) are real-valued thresholds, and determine the extent of sparsification along each axis given some "importance/confidence" score (to be defined later). Next, consider a set of objective functions {Q 1 , . . . , Q c+k } of the form Q i (λ 1 , . . . , λ n ) = E[q i (X, Y ; λ 1 , . . . , λ n )] for some loss function q i . Under our setting, we assume that the user wishes to arbitrarily bound the first c objective functions (hereby called risk functions) by {α 1 , . . . , α c } ∈ R c , while also minimizing the remaining k objective functions. We further assume that (λ 1 , . . . , λ n ) can be used to either increase or decrease each objective Q i (not necessarily independently), although we do not assume that all values of Q i are jointly achievable. To estimate suitable hyper-parameter values, let D cal = (X i , Y i ), i = 1, . . . , m be an i.i.d. calibration set that we will use to select ( λ1 , . . . , λn ). As functions of D cal , ( λ1 , . . . , λn ) are also random variables, and therefore Q i ( λ1 , . . . , λn ) is a random conditional expectation (constant given D cal ). Our goal is to select ( λ1 , . . . , λn ) in such a way that the {Q 1 , . . . , Q c } objectives that we wish to control are appropriately bounded with specifiably high probability-as we now formally define. Definition 3.1 ((α, δ)-risk controlling configuration). Let D cal = {(X i , Y i )} m i=1 be i.i.d. random variables that are used to estimate a model configuration ( λ1 , . . . , λn ). Let {Q i ( λ1 , . . . , λn )} c i=1 be a set of risk functions conditioned on the choice of ( λ1 , . . . , λn ). For any set of risk levels {α i } c i=1 and tolerance δ ∈ (0, 1), we say that ( λ1 , . . . , λn ) is a (α, δ)-risk controlling configuration, if:foot_1  P Q i ( λ1 , . . . , λn ) ≤ α i ≥ 1 -δ, ∀i ∈ {1, . . . , c} simultaneously, where the probability in Eq. ( 1) is over the draw of D cal . Many satisfactory configurations may exist for a given task and constraints. Our key practical goal is to find a (α, δ)-risk controlling configuration that also best minimizes the remaining objectives, {Q c+1 , . . . , Q c+k }. As such, we focus on the expected performance of Q c+j ( λ1 , . . . , λn ), for j ∈ {1, . . . , k} as a relative measure of effectiveness when comparing selected configurations.foot_2 

4. BACKGROUND

We briefly review the Learn then Test framework of Angelopoulos et al. (2021) , multiple hypothesis testing, and multiple objective optimization-which comprise the key components of our method. Learn then Test (LTT). The core idea of LTT is to use multiple hypothesis testing to rigorously select a risk controlling configuration λ = (λ 1 , . . . , λ n ) ∈ Λ. Consider a single risk Q. A set of possible configurations Λ g is chosen for testing, usually by defining a discrete grid over the configuration space Λ. For each configuration λ ∈ Λ g , LTT tests the null hypothesis, H λ : Q(λ) > α, i.e., that the risk is not controlled. A successful rejection of H λ then implies that λ is risk-controlling. A valid (super-uniform) p-value to use as a basis for accepting or rejecting H λ can be derived from concentration bounds on the empirical risk. For example, Hoeffding's inequality can be used to yield p cal (λ, α) = p( Qcal (λ); α, m) = e -2m(α-Qcal (λ)) 2 + , so that P(p cal (λ, α) ≤ u) ≤ u, for all u ∈ [0, 1]. Similar p-values can be derived from different bounds. Here, we will use the more powerful Hoeffding-Bentkus p-value throughout (Appendix A). A subset of valid λ configurations is then selected out of Λ g by applying a family-wise error rate (FWER) controlling procedure, so that the probability of making one or more false discoveries (i.e., choosing invalid λ) is bounded by δ ∈ (0, 1). This can be extended to multiple risk control by defining the combined null hypothesis, H λ : ∃ i where Q i (λ) > α i . A valid combined p-value can be obtained by taking the maximum p-value over all objective functions (see Appendix A for a proof): FST considers a sequence of hypothesis tests in some order, and terminates when the first null hypothesis H λ fails to be rejected. The efficiency of FST relies heavily on the ordering of hypothesis tests (e.g., those that are more likely to be rejected should be tested earlier). Defining a proper ordering for FST is challenging for a large, possibly unstructured (or with unknown structure) Λ. This challenge intensifies when combined with the additional goal of finding configurations that not only provide multiple risk control, but also optimize Q c+1 , . . . , Q c+k . p cal (λ, α) = max 1≤i≤c p( Qcal i (λ); α i , m), with α = (α 1 , . . . , α c ). Multiple objective optimization. Generally speaking, multi-objective optimization minimizes a vector-valued function g(λ ) : Λ → R r , where g(λ) = [G 1 (λ), . . . , G r (λ)] T consists of r objectives G i (we use G to avoid confusion with Q for now). For nontrivial multi-objective function g(λ) with conflicting objectives (e.g., accuracy vs. cost), there is no single point that minimizes all objectives simultaneously. Instead, for any λ, λ ′ ∈ Λ, we say that λ ′ dominates λ (λ ′ ≺ λ), if for every i ∈ {1, . . . r}, G i (λ ′ ) ≤ G i (λ), and for some i ∈ {1, . . . r}, G i (λ ′ ) < G i (λ). In other words, λ ′ dominates λ if there is no objective for which λ is superior to λ ′ , and for at least one objective λ ′ is strictly better. The Pareto optimal set consists of all points that are not dominated by any point in Λ: Λ par = {λ ∈ Λ : {λ ′ ∈ Λ : λ ′ ≺ λ, λ ′ ̸ = λ } = ∅}.

5. PARETO TESTING

We now present our method for selecting effective risk-controlling configurations. We adopt the strategy of Split FST (Angelopoulos et al., 2021) for separating the calibration data into two disjoint subsets D opt and D testing of sizes m 1 and m 2 , respectively. The first split is used for defining an ordered sequence of configurations to test, while the second is used to conduct the hypothesis tests.

5.1. CONSTRUCTING THE PARETO FRONTIER

We begin with defining a set of tests/configurations to consider. We solve the multi-objective optimization problem defined by the vector-valued function q(λ) : Λ → R c+k that consists of all of the objective functions (both constrained and unconstrained), i.e. q(λ) = [Q 1 (λ), . . . , Q c+k (λ)] T . Practically, we use q opt (λ) with empirical objectives Qopt i (λ) defined over D opt . Any efficient solver for multi-objective optimization can be used to approximate the Pareto optimal set Λ par . We will show results with both brute-force optimization over a grid of configurations, as well as with a multi-objective Bayesian optimizer (Lindauer et al., 2022) . The main idea of our method is to perform testing only over the Pareto optimal set. This consists of the most "promising" configurations, and provides the best achievable trade-offs between all objectives (with respect to D opt ). As mentioned earlier, a major challenge when dealing with a large hyper-parameter space is that testing numerous configurations can quickly lead to a loss in statistical efficiency. We overcome this by focusing only on the "optimal" region of the hyper-parameter space.

5.2. ORDERING THE PARETO FRONTIER

We now define an ordering for the set of tests/configurations on the Pareto frontier along which to conduct FST. We take the simple, but empirically effective, strategy of ordering λ ∈ Λ par by their (combined) estimated p-values p opt (λ, α) = max 1≤i≤c p( Qopt i (λ); α i , m 1 ), which we compute over D opt (the same data used to recover the Pareto frontier, but separate from testing data). Converting the c constrained dimensions to p-values and taking their maximum, allows us to align and compare risks of different types (e.g., binary 0/1 error vs. precision/recall rates in [0, 1]), or that are controlled by different bounds (e.g., α i ≪ α j ). Intuitively, because we focus on the Pareto optimal set, for each configuration along this ordering there is no other configuration with a lower estimated p-value that is also expected to be dominant on the k free objectives. Note that for c > 1, we can Algorithm 1 Pareto Testing Definitions: f is a configurable model with n thresholds λ = (λ1, . . . , λn). D cal = Dopt ∪ Dtesting is a calibration set of size m, split into optimization and (statistical) testing sets of size m1 and m2, respectively. {Q1, . . . , Q c+k } are objective functions. α = {α1, . . . , αc} are user-specified risk bounds for the first c objectives. Λ is the configuration space. δ is the tolerance. PARETOOPTIMALSET returns the Pareto frontier, and can either be computed via a multi-objective optimization algorithm or exhaustive search (see Algorithm F.1).

1: function

OPTIMIZATION(Dopt, Λ, α) 2: Λpar ←PARETOOPTIMALSET(Λ, Qopt 1 , . . . , Qopt c+k ) 3: p opt (λ, α) ← max 1≤i≤c p( Qopt i (λ); αi, m1 ), for all λ ∈ Λpar 4: Λordered ← Order configurations according to increasing p opt (λ, α) 5: return Λordered 6: function CALIBRATION(Dtesting, Λordered, α, δ) 7: Qtesting i (λ) ← 1 m 2 (X,Y )∈D testing qi(X, Y ; λ), for all λ ∈ Λordered, and 1 ≤ i ≤ c 8: p testing (λ, α) ← max 1≤i≤c p( Qtesting i (λ); αi, m2 ), for all λ ∈ Λordered 9: Apply FST: Λr = {λ (j) : j < J}, J = minj{j : p testing (λ (j) , α) ≥ δ} 10: Λ * ← PARETOOPTIMALSET(Λr, Qtesting c+1 , . . . , Qtesting c+k ) 11: return Λ * (optionally) prune the frontier by considering only the subset Λ ′ par ⊆ Λ par that is optimal with respect to qopt (λ, α) = [p opt (λ, α), Qopt c+1 (λ), . . . , Qopt c+k (λ)] T . In other words, since we only care about the maximum p-value over the constrained objectives, we can ignore configurations in Λ par that only differ along the constrained dimensions, without affecting the free dimensions or the combined p-value. After defining and ordering Λ par over D opt , we then proceed with FST over D testing to identify a subset Λ r ⊆ Λ par of configurations for which we can successfully reject H λ (i.e., that are valid risk-controlling configurations). Finally, after obtaining the validated subset Λ r , we again find and return the resulting Pareto frontier (now, with respect to only the objectives of practical interest: {Q c+1 , . . . , Q c+k }). For k = 1, this consists of a single configuration λ * , while for k > 1, this consists of a set of non-dominated, valid configurations Λ * .

5.3. APPLYING FIXED SEQUENTIAL TESTING ON THE PARETO FRONTIER

Our method is summarized in Algorithm 1 and is illustrated in Figure 2 . We call it Pareto Testing for two reasons: (i) the method reduces to applying FST over a path defined by the Pareto front of the multi-objective problem, and (ii) repeated testing for different α limitations yields a calibrated Pareto frontier with constraints on specific dimensions in the objective function space. It is straightforward to show that Pareto Testing achieves valid risk control, as we now formally state. Proposition 5.1. Let D cal = {(X i , Y i )} m i=1 be a set of i.i.d. random variables split into two disjoint subsets, D opt and D testing . Let p testing (λ, α) be a valid p-value for a configuration λ, where P(p testing (λ, α) ≤ u) ≤ u for all u ∈ [0, 1] over the draw of D testing . Then all configurations in the output set Λ * of Algorithm 1 are also simultaneously (α, δ)-risk controlling configurations. The proof, given in Appendix A, follows from Split FST. Note that for k > 1, the chosen set Λ * contains configurations that are simultaneously valid. We are therefore free to use any λ * i ∈ Λ * , as well as any randomly combined configuration in the convex hull of Λ * defined as follows. Consider a randomized strategy, where for each test point, we sample the configuration λ * j ∈ Λ * to use with probability ∆ j , where ∆ lies in the |Λ * | -1 dimensional probability simplex. The resulting combination is also (α, δ)-risk controlling, and allows for different (average) outcomes on the k free objectives. Corollary 5.2. Any randomized time-shared use of configurations in Λ * is (α, δ)-risk controlling.

6. ADAPTIVE MULTI-DIMENSIONAL PRUNING

We now turn to a concrete application of our method, in which we demonstrate its effectiveness for reliably accelerating Transformer models (Vaswani et al., 2017) . Here we pair each pruning "dimension" with a score function that estimates the relative importance of each prunable element in that dimension. By thresholding scores, we obtain different versions of the model with different performance. We assume a K-layer Transformer model with W attention heads per layer, and L(X) input tokens (see Vaswani et al. (2017) for a complete description of the Transformer model).

6.1. CONFIGURABLE DIMENSIONS

We consider the following adaptive behaviors (see also Appendix B for details): 1. Token pruning. We assign each token with an (after-the-fact) importance score, based on the gradient of the output probability w.r.t. the token embedding. To determine token importance at run-time, we predict the score at each layer (Modarressi et al., 2022) , then remove tokens with estimated scores bellow a threshold λ tok , yielding a sequence of size L j (X; λ tok ) at the j-th layer. 2. Early exiting. We attach a softmax classifier head to each layer, and exit whenever the predictive entropy of its output is below a threshold λ layer . The exit layer is denoted as K exit (X; λ layer ). 3. Head pruning. Similar to token pruning, we compute an importance score for each attention head per layer by the gradient of the loss over the validation set (from training) w.r.t the head hidden representation, following Michel et al. (2019) . W j (λ head ) denotes the number of retained heads in layer j. Note that this is fixed for all inputs, unlike the previous mechanisms.

6.2. OBJECTIVE FUNCTIONS

We also define several practical objective functions. f (•; λ 0 ) denotes the full model without pruning. Relative computational cost. We define the relative computations cost by the ratio between the computational cost of the pruned model and the computational cost of the full model: Q cost (λ) = E Kexit(X;λ layer ) j=1 W j (λ head ) • L j (X; λ tok ) 2 K j=1 W • L(X) 2 . ( ) Eq. ( 5) reflects a simplistic definition that incorporates a quadratic dependency on the sequence length due to the attention mechanism, and a linear dependency on the number of attention heads. We also consider total FLOPs (floating-point operations) per forward pass. Relative accuracy reduction. Speeding up the run-time of a model can also degrade its accuracy. Define the random variable D(X, Y ; λ) = 1{f (X; λ 0 ) = Y } -1{f (X; λ) = Y }, which is 0 when both model predictions are the same, 1 when the full model is correct while the pruned model is incorrect, and -1 if the opposite is true. We define the relative accuracy reduction as: Q acc (λ) = E [D(X, Y ; λ)] = E [1{f (X; λ 0 ) = Y }] -E [1{f (X; λ) = Y }] , i.e, the difference in accuracy between the full and pruned models. In order to exploit p-values derived from confidence bounds that assume the risk is in [0, 1] (Angelopoulos et al., 2021), we define D ′ (X, Y ; λ) = [D(X, Y ; λ)] + , which differs only for the rare event that the pruned model is correct while the full model is not, and is more restrictive since E [D(X, Y ; λ)] ≤ E [D ′ (X, Y ; λ)]. Worst-class relative accuracy reduction. In some cases, we would like to control the worst-class accuracy, or equivalently, that every class accuracy reduction is controlled by the same level: Q acc-class (y; λ) = E [D ′ (X, Y ; λ) | Y = y] ≤ α, ∀y ∈ Y. Note that this adds an additional |Y| objectives (that can still be solved efficiently, see Appendix A). Selective classification abstention rate. Consider a selective classification problem, where the model is allowed to abstain from making a prediction when it is unsure, based on some threshold τ on the model's confidence (we use the probability of the predicted class max y f (X, y; λ)). In this case, we re-define the relative accuracy and cost reductions to be conditioned on making a prediction. We also introduce abstention rate objective (e.g., abstain from prediction at most 20% of the time): Baselines and Evaluation. We present both risk controlling and non-risk-controlling baselines. Q abstention-rate (λ, τ ) = E 1 max y f (X, y; λ) < τ . Non-risk-controlling baselines: (1) α-constrained, the solution to the constrained optimization problem in Eq. ( 21); (2) (α, δ)-constrained, the same as before, but with constraints defined over p-values, which is equivalent to testing without FWER control. Risk-controlling baselines: (3) 3D SGT, SGT defined over a 3D graph, see Algorithm F.3; (4) Split FST, the split method proposed in Angelopoulos et al. (2021) , where a set of hypotheses is ordered by increasing estimated p-values. For fairness, each baseline (including our method) operates over the same predefined grid of configurations. We use 6480 configurations in total (18 head, 20 token, and 18 early-exit thresholds). Note that the recovered Pareto front in this case is restricted to points in this grid, see Algorithm F.1. We also show the results obtained while using a multi-objective optimizer to demonstrate the actual computationally efficient implementation of our method (rather than brute-force exploration of the grid). We repeat each experiment over different splits of calibration and test data (50-100 runs in total), and report the mean over all splits (with 95% CIs) for the configurations selected by each method. Two objectives (one controlled, one free). We start with a two-objective scenario, where we wish to control the accuracy reduction (Eq. ( 6)), while minimizing the cost (Eq. ( 5)). The average accuracy reduction and relative cost are presented in Fig. 3 for the risk controlling baselines. We observe that the proposed method obtains the lowest cost among the risk controlling baselines for all α values and across all tasks. In particular, it can be seen that Split FST obtains slightly looser control of relative accuracy reduction, but higher relative computational costs compared to Pareto Testing. Ordering by p-values alone does not take into account scenarios where several configurations have similar accuracy, but vary in costs, while the proposed method optimizes the selection and ordering of configurations in both dimensions. We also see that 3D-SGT performs well for low α values, but often becomes worse as α increases. A possible factor is that as α increases, 3D testing is allowed to explore more of the 3D graph, but does so inefficiently-leading to overall lower rejection rates. Figure 4 shows the difference between the risk controlling and the non-risk-controlling baselines in terms of satisfying Definition 3.1. In non-risk controlling baselines (left), the risk exceeds α more frequently than the allowed tolerance level δ = 0.1. By contrast and as expected, we see that all the risk controlling baselines (right) are always below the tolerance level. Three objectives (two controlled, one free). We study a scenario with three objectives on MNLI, where we control both the average accuracy (Eq. ( 6)) and the worst-class accuracy (Eq. ( 7)) while minimizing cost (Eq. ( 5)). We vary the values of α 1 for average accuracy and set α 2 = 0.15 for worst accuracy. Figure 5 reports the results of the three objective functions. It can be seen that when α 1 is small, testing is dominated by average accuracy (worst accuracy is not tight), and as α 1 increases, worst accuracy becomes dominant and average accuracy becomes loosely controlled. Here too we see that Pareto Testing obtains improved cost reduction with respect to the other baselines. Results with an off-the-shelf optimizer. On the first scenario with accuracy control and cost minimization, in Figure D.1 we show a comparison between the proposed method with grid (blue) and multi-objective optimizer (red to yellow) with different number of function evaluations. For multi-objective optimizer, we used an implementation (Lindauer et al., 2022) of ParEGO optimization algorithm (Knowles, 2006; Cristescu & Knowles, 2015) . We observe that even with a small number of evaluations (e.g., 50), we obtain reasonable results, which further improve as the number of evaluations increases. The grid option performs better for certain α values, but it requires significantly more function evaluations. A more in depth analysis of how the multi-objective optimization method and the allowed number of evaluations influence testing efficiency is left for future work. Additional results. We briefly highlight a number of results contained in Appendix D. On the same "accuracy control" setting, we report FLOPs saved, as an alternative measure for cost improvement. In addition, we show flipped results for controlling cost while minimizing the relative loss in accuracy. We also explore a selective prediction setting with three objectives when one is controlled while two are free. Specifically, we control the selective accuracy loss (Eq. ( 17)), while minimizing both the selective cost (Eq. ( 16)) and the abstention rate (Eq. ( 8)). Figure D.2 reports the cost and abstention rate for the chosen configurations by either the proposed method and Split FST. It can be seen that Pareto Testing selects a richer set of configurations that conveys better cost-coverage trade-offs.

8. CONCLUSION

Deployment of machine learning models in the real world can frequently demand precise guarantees that certain constraints will be satisfied, together with good empirical performance on other objectives of interest. In this work, we presented Pareto Testing, a two-stage procedure for multiple risk control combined with multi-objective optimization. In the first stage, Pareto Testing relaxes all constraints, and converts the problem to a standard multi-objective optimization format that can be efficiently solved with off-the-shelf optimizers to yield a Pareto frontier of hyper-parameter configurations affecting model performance. In the second stage, this Pareto frontier is filtered via multiple hypothesis testing to identify configurations that simultaneously satisfy the desired risk constraints with (specifiably) high probability-while also being effective solutions for the free objectives. Bridging theory and practice, we demonstrated the effectiveness of our method for reliable and efficient adaptive computation of Transformer models on several text classification tasks under various conditions. 

A MATHEMATICAL DETAILS

We present the proofs for our theoretical claims.

A.1 MAX P-VALUE FOR MULTIPLE RISKS

First we re-state and re-prove that taking the maximum p-value is also a valid p-value. Lemma A.1. Let p i (λ, α) be a p-value for H λ,i : Q i (λ) > α i , for each i ∈ {1, . . . , c}. Define p(λ, α) := max 1≤i≤c p i (λ, α i ). Then, for all λ such that H λ : ∃i where Q i (λ) > α i holds, we have: P (p(λ, α) ≤ u) ≤ u (9) Proof. Let I ⊆ {1, . . . , c} be the set of all true null hypotheses at λ. We have: P p(λ, α) ≤ u ≤ P max i∈I p i (λ, α) ≤ u = P i∈I p i (λ, α i ) ≤ u ≤ max i∈I P (p(λ, α i ) ≤ u) . Since for each i ∈ I, We now prove simultaneous (α, δ)-control over Λ r . Let H λ ′ be the first true null hypothesis in the sequence. Given that p(λ ′ , α) is a super uniform p-value under H λ ′ , the probability of making a false discovery at λ ′ is bounded by δ. However, if H λ ′ fails to be rejected (no false discovery), then all other H λ that follow in the sequence also fail to be rejected (regardless of if H λ is true or not). So the probability of making any false discoveries is also bounded by δ. P (p i (λ, α i ) ≤ u) ≤ u, we have max i∈I P (p(λ, α i ) ≤ u) ≤ u, This implies that the probability that all configurations in Λ * ⊆ Λ r are risk controlling is at least 1 -δ, which also implies that any configuration in Λ * is (α, δ)-risk controlling. A.3 PROOF OF PROPOSITION 5.2 Proof. We restate our randomized time-sharing strategy: for each test point (X, Y ) we independently sample the configuration λ * j ∈ Λ * to use with probability ∆ j , where ∆ ∈ S |Λ * | is a point in the |Λ * | -1 dimensional probability simplex S |Λ * | = ∆ ∈ R |Λ * | j ∆ j = 1, ∆ j ≥ 1 . Given D cal , Λ * is a constant set, and each risk Q i (λ * j ) = E[q i (X, Y ; λ * j ) ] is also a constant for all λ * j ∈ Λ * and i ∈ {1, . . . , c}. For each Q i , the combined risk of the time-sharing strategy can then be derived as the mean of a mixture model where Q i (λ share ) = |Λ * | j=1 ∆ j Q i (λ * j ) ≤ |Λ * | j=1 ∆ j max j ′ Q i (λ * j ′ ) = max j ′ Q i (λ * j ′ ). Let E be the event that all λ * j ∈ Λ * are risk controlling across all Q i at level α i given the draw of D cal . Proposition 5.1 gives that this event occurs with probability at least 1 -δ. Therefore, P Q i (λ share ) ≤ α i ≥ P(max j ′ Q i (λ * j ′ ) ≤ α i ) ≥ 1 -δ, ∀i ∈ {1, . . . , c} simultaneously. ( ) Thus we have that λ share is also (α, δ)-risk controlling for any choice of ∆.

A.4 HOEFFDING-BENTKUS INEQUALITY P-VALUES

The Hoeffding-Bentkus from (Bates et al., 2021) is a combination of Hoeffding and Bentkus inequalities: p Qcal (λ); α, m = min exp{-mh 1 ( Qcal (λ) ∧ α, α)}, eP Binom(m, α) ≤ ⌈m Qcal (λ)⌉ (13) where h 1 (a, b) = a log( a b ) + (1 -a) log( 1-a 1-b ).

A.5 WORST-CLASS OBJECTIVE

The empirical risk for the class-conditioned objective E [D ′ (X, Y ; λ)|Y = y] is computed over the samples in class y:  Qacc-class (y, λ) = 1 |D y cal | (X,Y )∈D y cal D ′ (X, Y ; λ), D y cal = {(X, Y ) ∈ D cal |Y = y}

A.6 SELECTIVE CLASSIFICATION

Similarly to above derivation for per-class accuracy, for selective classification we define the selective cost (given selection): Q select-cost (λ, τ ) = E q cost (X; λ) max y f (X, y; λ) ≥ τ (16) and the same for selective accuracy reduction: Q select-acc (λ, τ ) = E q acc (X; λ) max y f (X, y; λ) ≥ τ (17) which are evaluated empirically by: Qselect-cost (λ, τ ) = 1 |D τ cal | X∈D τ cal q cost (X; λ) Qselect-acc (λ, τ ) = 1 |D τ cal | X∈D τ cal q acc (X; λ) where D τ cal = {X ∈ D cal | max y f (X, y; λ) ≥ τ }. B MULTI-DIMENSIONAL PRUNING First, we describe the core model units and introduce some essential notation, while keeping the exact model implementation as general as possible. Second, we describe each of the pruning dimensions, its associated importance score, and the thresholding mechanism.

B.1 TRANSFORMER MODEL

Consider a Transformer model (Vaswani et al., 2017; Devlin et al., 2018) with K layers. The input to the model is given as a sequence of L (for notational simplicity we omit here the dependency of the length on x) tokens x = (x 1 , . . . , x L ), which are first mapped to learneable word embeddings e = (e 1 , . . . , e L ). Tokens are then passed through the model's layers, with h j = (h j,1 , . . . , h j,L ) denoting the j-th layer hidden representation. Each layer consists of a multi-head attention, with W heads producing the combined output a j = W w=1 Attn j,w (h j-1 ), which is followed by a feedforward network to provide the next layer hidden representation. The last layer is attached to a classification head with |Y| outputs, where f (x, y) denotes the output of class y. The model is optimized by minimizing a loss function L computed empirically over the training set.

B.2 TOKEN PRUNING

It is often the case that the input consists of a large amount of tokens that have a negligible contribution for the prediction task. The idea in token pruning is to identify unimportant tokens and discard them at some point in the model. To identify the contribution of each token, we attach to each layer a token importance predictor s tok j : X → R based on the token hidden representation h j,l . Following (Modarressi et al., 2022) , we use as importance scores gradient attributions, computed by: r l = ∂f (x, y c ) ∂e l ⊙ e l 2 where y c is the true label, and ⊙ denotes element-wise product. The token importance predictors in each layer are optimized with a cross-entropy loss, where the labels are the scores, normalised to sum to one. In the j-th layer, tokens with s tok j (x l ) < λ tok are pruned and are not transferred to next layer. The number of tokens remaining after pruning is given by L j (x; λ tok ) = L l=1 j j ′ =1 1 s tok j ′ (x l ) > λ tok .

B.3 EARLY EXITING

Early-exiting is based on the idea that examples vary in their difficulty level, hence, require different amount of computation to reach to a good prediction. While for simple examples a decision can be made early on, difficult examples may require going through the full model. We attach a prediction head f j : X → Y to each layer, trained to predict the labels via the same loss function as the original model. Following (Liu et al., 2020) , we define the importance score based on the prediction head entropy: s layer j (x) = y∈|Y| p j (y|x) log p j (y|x) where p j (y|x) are the per-class probabilities provided by the j-th prediction head. Based on this score, examples with s layer j (x) < λ layer exit in the j-th layer. The exit layer of x is given by K exit (x; λ layer ) = arg min j j ∈ {1, . . . , K} s layer j (x) < λ layer .

B.4 HEAD PRUNING

It was shown by Michel et al. (2019) that a significant fraction of attention heads can be removed with a little impact on the performance. Each attention head (w, j), 1 ≤ j ≤ K, 1 ≤ w ≤ W is assigned a score s head j (w) according to: The scores in each layer are normalized to sum to one. Attention heads with s head j (w) < λ head are pruned. s head j (w) = Attn j,w (h j-1 ) T ∂L ∂Attn j,w (h j-1 ) . The number of heads left after pruning is given by W j (λ head ) = W w=1 1 s head j (w) > λ head . Note that this is a fixed pruning, unlike the previous pruning dimensions that vary according to the input x.

C IMPLEMENTATION AND DATASET DETAILS

Datasets. Splitting specifications and full model performance on each task are contained in Table C.1. Note that for IMDB, QQP and MNLI we used a subset of the original dev/test set in order to expedite evaluation. For MNLI we used the split of (Sagawa et al., 2019) . Prediction Heads. Each prediction head is a 2-layer feed-forward neural network with 32 dimensional hidden states, and ReLU activation. The input is the hidden representation of the [CLS] token concatenated with the hidden representation of all previous layers, as was proposed in (Wołczyk et al., 2021) . Token importance predictors. Each token importance predictor is a 2-layer feed-forward neural network with 32 dimensional hidden states, and ReLU activation. The input is the hidden representation of each token in the current layer and all previous layers, following (Wołczyk et al., 2021) . Training. The core model is first finetuned on each task. We compute the attention head importance scores based on validation data. We freeze the backbone model and train the early-exit classifiers and the token importance predictors on the training data. Code. Our code will be made available at https://github.com/bracha-laufer/ pareto-testing.

D ADDITIONAL BASELINES AND RESULTS

Our method is based on the Learn then Test (LTT) framework (Angelopoulos et al., 2021) , which is summarized in Algorithm F.2. We compare our method to two baselines from (Angelopoulos et al., 2021): 3D-SGT summarised in Algorithm F.3, which is a 3D extension to the 2D Hamming SGT, and Split-FST described in Appendix E. Note that we consider a broader setting in which, besides multiple risk control we wish to optimize additional free objective functions. In addition, we define two non-risk controlling baselines: α-constrained -A constrained optimization problem can be defined by: min λ∈Λ q cal k (λ) .t. Qcal i (λ) < α i , ∀1 ≤ i ≤ c, where q cal k (λ) = [ Qcal c+1 λ), . . . , Qcal c+k (λ)] T . Directly solving for Eq. ( 21) over the calibration data, however, would not necessarily yield a generalizable λ with the desired 1 -δ probability. In other words, the true risk Q i ( λ) over test data might exceed α i , possibly with high probability. (α, δ)-constrained -We simultaneously test all possible configurations with error level δ without correcting for multiple hypothesis testing. Moreover, we develop two additional baselines and present their results herein: Low-Risk Path -similar to Split-FST (and Pareto Testing), this is a dual-stage method, assuming that the calibration data is split into two subsets. In the first stage, we find a solution to the constrained optimization problem, defined in Eq. ( 21). Then a low risk path is defined from full model to the solution. The path is defined over the grid of hyper-parameter combinations, where in each step we pick a neighbouring hyper-parameter combination (increasing one hyper-parameter dimension with respect to previous step) with lowest risk among all neighbours. The method is summarized in Algorithm F.4. Note that as the method defines the path in the hyper-parameter space, it implicitly assumes that the objective functions are monotonic with respect to each of the hyper-parameters. Constrained-Path Testing -This can be considered a variant of the proposed method. When interested in a single configuration selection for specific α and δ, a cheaper (but not equivalent) approach would be to solve multiple constrained problems: min λ∈Λ q opt k (λ) (22) s.t. Qopt i (λ) < α i -ϵ, ∀1 ≤ i ≤ c , for a sequence of ϵ values in [0, . . . , min i α i ]. Then, an ordered set of configurations to test is defined by the solutions to Eq. ( 22) with decreasing values of ϵ. Note that both the constrained and the full multi-objective variants are equivalent for the case of a single control constraint. However, when there are multiple constraints Pareto Testing operates on a larger set of hyper-parameter combinations, consisting of solutions to: min λ∈Λ q opt k (λ) (23) s.t. Qopt i (λ) < α i -ϵ i , ∀1 ≤ i ≤ c, with ϵ i values in [0, . . . , α i ], namely, solving the constrained problem for all possible combinations of (ϵ 1 , . . . , ϵ c ). Pruning model. Figure D.3 shows the accuracy and the relative cost of the proposed adaptive pruning model f for various threshold combinations, computed over test data. We see that the fusion of all pruning dimensions yields a wide variety of configurations with a clear trade-offs between accuracy and cost. In addition, the Pareto front consists of different threshold combinations, indicating that the optimal threshold value in each dimension is not fixed and varies with the desired cost/accuracy level. Two objectives -accuracy controlled, cost minimized (additional results). Results for additional baselines are shown in Fig. D .4, including the two non-controlling baselines: α-constrained and (α, δ)-constrained, and our derived Low-Risk Path baseline. We see that, in many cases, Low-Risk Path obtains similar cost reduction compared to our proposed method, however, it is inferior for certain tasks and α values. As expected, both non-risk controlling baselines obtain lower costs compared to our method. This of course comes with a price of suffering from risk violations that exceed δ = 0.1, as can be seen in the bottom line bar plots. However, we see that the there is not a large difference between the cost reductions obtained by our method and the non-controlling baselines. This indicates that our method, though providing risk control guarantees, as opposed to the non-controlling baselines, is not overly conservative, and effectively optimizes the free objective function, leading to significant cost reduction. We further examine the different aspects of the proposed method. The size of the Pareto optimal set obtained for each task is shown in Fig. D .5. We see that the size of the Pareto set varies among tasks and is around 3 -5% of the original set (6480 configurations for each task), which is a significant Two objectives -cost controlled, relative accuracy loss minimized. We evaluated the opposite scenario where the relative cost is controlled, while accuracy reduction is minimized. Note that for 3D-SGT and Low-Risk Path we start testing from the empty model (lowest cost risk) towards the full model. The accuracy reductions of all methods are summarized in Table D .1. We observe that as opposed to accuracy reduction, here the results of all methods are similar (except for Low-Risk Path). Two objectives -accuracy controlled, FLOPs minimized. We experimented with a different cost measure in terms of FLOPs speed-up, which can be considered as a more practical measure, tailored to the specific architecture being used. Results are summarized in Table D .2. We see that the proposed method almost always obtains the best speed-ups, which is inline with our other results. Three objectives -accuracy/worst accuracy controlled, cost minimized. Results for additional baselines are shown in Fig. D .9. Here too, our method performs the best. In this scenario, Constrained-Path Testing obtains similar results while Low-Risk Path is significantly worse. Three objectives -accuracy and abstention rate controlled, cost minimized. We use the same setup of selective classification described in §7. Note that combining τ with the other three pruning dimensions, we obtain a four dimensional hyper-parameter space, where there is a complex interplay between the hyper-parameters and the risk functions. As τ increases we expect to get better accuracy-cost trade-offs since we remove difficult examples. In addition, abstention rate is monotonic with respect to τ but is also influenced by the pruning dimension in an uncharacterized manner. Here we control both accuracy reduction and abstention rate, while minimizing cost. Since the risk functions are not necessarily monotonic with respect to all hyper-parameters, Low-Risk Path is not relevant. Moreover, since here we have a 4D hyper-parameter space, 3D SGT cannot be applied. Results are summarized in Fig. D .10, where all methods obtain similar results.

E FAMILY-WISE ERROR RATE CONTROL

Let Λ g denote a set of possible configurations to test, and Λ r ⊆ Λ g denote the set of rejected hypotheses. When performing multiple hypothesis testing (MHT), a FWER-controlling procedure accounts for controlling the probability of making one or more false discoveries, i.e. falsely rejecting at least one true null hypothesis: P (|Λ r ∩ Λ 0 | ≥ 1) ≤ δ where Λ 0 ⊆ Λ g is the set of configurations for which the null hypothesis is true. tial error budget is allocated to the bottom-right node, and the error budget is propagated outward. Another option is 'Fallback', where the error budget is split between each possibility in the first dimension. An FST is then performed for a fixed value on the first dimension, and progressing in the other dimension. Split Fixed Sequence Testing. Proposed in (Angelopoulos et al., 2021) , Split-FST can be utilized when there is no clear structural relationship between the hypotheses for defining a graph for SGT. The core idea is to split the calibration data in two subsets, where the first split is used to learn the graph, while the other is used for testing. Specifically, they propose to define a sequence of p-values β ranging from 0 to 1. Then, for each β, find the hypothesis where the p-values of all risks (computed over the first split) are the closest (in vector infinity norm) to β. Based on this ordering, FST can be then performed over the second split. Recover a subset of thresholds Λ r ⊆ Λ g for which the null hypothesis is rejected, by applying a FWER controlling procedure. 1 , . . . , λ I 1 }×{λ 1 2 , . . . , λ J 2 }×{λ 1 3 , . . . , λ K 3 }, λ i,j,k = (λ i 1 , λ j 2 , λ k 3 ), 3D graph W with I ×J ×K nodes and weights W i ′ ,j ′ ,k ′ →i,j,k determining the error propagation, A an I × J × K matrix with initial error budget for each configuration, satisfying i,j,k A i,j,k = δ. for λ ∈ Λ g do 3: For λ ∈ Λ g compute Qcal i (λ) = 1 m (X,Y )∈Dcal q i (X, Y ; λ).

4:

For λ ∈ Λ g compute, compute p-values p cal (λ, α) = max 1≤i≤c p Qcal i (λ); α i , m .

5:

Λ r ← SGT W, p cal (λ, α) . A i,j,k ← A i,j,k + A i * ,j * ,k * W i * ,j * ,k * →i,j,k , ∀(i, j, k) ∈ I 16: i * , j * , k * = arg min (i,j,k)∈I p cal (λ i,j,k , α)/A i,j,k return Λ r Algorithm F.4 Shortest-Path Testing Definitions: configurable model f adapted by n thresholds λ = (λ 1 , . . . , λ n ), D cal = D opt ∪ D testing is a calibration set of size m, split into optimization and (statistical) testing sets of size m 1 and m 2 , respectively, objective functions Q 1 , . . . , Q c , user-specified control limits α = (α 1 , . . . , α c ) and tolerance level δ, hyper-parameter resolution γ = (γ 1 , . . . , γ n ), where γ j is the resolution in the j-th dimension, λ min consists of the minimum values of all thresholds, λ max consists of the maximum values of all thresholds. Apply constrained optimization to find optimal configuration λ opt .

4:

Λ opt ← CREATEPATH (λ min , λ opt , γ) ∪ CREATEPATH (λ opt , λ max , γ). i ← i + 1 return λ (0) , . . . , λ (i) 21: function CALIBRATION(D testing , Λ opt , α, δ) ▷ Same as in Algorithm 1



If we fail to find any valid configurations (which may not exist) with the right confidence, then we abstain. This is a slight abuse of terminology in that, technically, ( λ1, . . . , λn) as a random variable is not necessarily a configuration that achieves risk control, but rather its realizations are valid with the appropriate probability. This is analogous to using the average set size to compare conformal predictors(Vovk et al., 2005;2016).



Figure 2: Pareto Testing with two objectives. Q 1 is controlled at α while Q 2 is minimized. FST is applied along the sequence of configurations on the Pareto front, ordered from low to high expected risk w.r.t. Q 1 .

Figure 3: Two objectives (100 random splits). Relative accuracy reduction is controlled, while computational cost is minimized. Top plots show accuracy differences; bottom plots show relative cost.7 EXPERIMENTSExperimental setup. we test our method over five text classification tasks of varied difficulty levels: IMDB(Maas et al., 2011), AG News (Zhang et al., 2015), QNLI(Rajpurkar et al., 2016), QQP, MNLI(Williams et al., 2018). We use a BERT-base model(Devlin et al., 2018) with K = 12 layers and W = 12 heads per layer, and attach a prediction head and a token importance predictor per layer.

Figure 4: Two-objectives, QNLI (100 splits). Acc. reduction is controlled, cost is minimized. Left: histogram of acc. reduction, α = 0.05; middle: violin plots of acc. reduction; right: risk violations.

Figure 5: Three-objectives, MNLI (100 random splits): average accuracy is controlled by α 1 ∈ {0.025, 0.5, . . . , 0.2}, worst accuracy is controlled by α 2 = 0.15, δ = 0.1, and cost is minimized.

14) Note that the per-class risks are of the same type and are bounded by the same α. If furthermore classes are approximately balanced, i.e |D y cal | ≈ |D ỹ cal | ∀y, ỹ ∈ Y, than due to the monotonicity of the p-value with respect to the empirical risk, we have: p(λ, α) = max y∈Y p Qacc-class (y, λ); α, |D y cal | ≈ p max y∈Y Qacc-class (y, λ); α, |D y cal | .(15)Therefore, instead of |Y| objective functions, one per each class, we can define a single equivalent empirical objective Qacc-worst (λ) = max y∈Y Qacc-class (y, λ).

Figure D.1: Results of Pareto Testing over AG News with multi-objective optimizer for different number of evaluations, and with a grid of thresholds. Results are averaged over 50 random splits. Accuracy reduction is controlled and cost is minimized.

Figure D.2: Three-objectives, AG News -one controlled, two free: accuracy reduction is controlled by α = 0.05, δ = 0.1, cost and abstention rate are minimized. The coloring is according to accuracy reduction.

Figure D.3: Shows cost and accuracy trade-offs over test data provided by a grid of 6480 configurations (18 head, 20 token and 18 early-exit thresholds).

Figure D.4: Additional results for two-objectives -accuracy reduction is controlled by α ∈ {0.025, 0.5, . . . , 0.2}, δ = 0.1, cost is minimized. Top row: accuracy reduction; middle row: relative cost; last row: rate of risk-violations. Results are averaged over 100 random splits to calibration and test.

Figure D.5: Additional results for two-objectives: accuracy reduction and cost. Sizes of the Pareto set for each task.

Figure D.6: Additional results for two-objectives -accuracy reduction is controlled by α ∈ {0.025, 0.075, 0.125, 0.2}, δ = 0.1, cost is minimized. Compare sizes of validated configurations for Pareto Testing and Split FST (computed over 20 random trials).

Figure D.7: Additional results for two-objectives -accuracy reduction is controlled by α ∈ {0.05, 0.1, 0.15}, δ = 0.1, cost is minimized. Top: violin plots for accuracy reduction; bottom: violin plots for relative cost (computed over 100 random splits).

Figure D.9: Three-objective scenario, MNLI -two controlled, one free: average accuracy is controlled by α 1 ∈ {0.025, 0.5, . . . , 0.2}, worst accuracy is controlled by α 2 = 0.15, δ = 0.1, cost is minimized without control. Results are averaged over 100 random splits to calibration and test.

Figure D.10: Three-objective scenario, AG News -two controlled, one free: average accuracy is controlled by α 1 ∈ {0.025, 0.5, . . . , 0.2}, abstention rate is controlled by α 2 = 0.1, δ = 0.1, cost is minimized. Results are averaged over 100 random splits to calibration and test.

Algorithm F.3 3D Graph TestingDefinitions: configurable model f adapted by n thresholds λ = (λ 1 , . . . , λ n ), calibration data D cal of size m, objective functions Q 1 , . . . , Q c , user-specified control limits α = (α 1 , . . . , α c ) and tolerance level δ, a grid of I ×J ×K thresholds Λ g = {λ 1

1: function CALIBRATE(D cal , α, A, W) 2:

SGT(W, p cal (λ, α))8: I = {1, . . . , I} × {1, . . . , J} × {1, . . . , K} 9: Λ r ← {} 10: i * , j * , k * = arg min (i,j,k)∈I p cal (λ i,j,k , α)/A i,j,k 11: while |I| ≥ 1 do 12: if p cal (λ i,j,k , α) < A i * ,j * ,k * then 13: Λ r ← Λ r ∪ λ i * ,j* ,k * 14: I ← I/(i * , j * , k * ) 15:

1: function OPTIMIZATION(D opt , α) 2:Define the constrained problem (21).

CREATEPATH(λ start , λ end , γ) if p opt (λ next , α) < p min then 18: λ (i) ← λ next 19: p min ← p opt (λ next , α) 20:

Multiple hypothesis testing.A key component of LTT is the choice FWER-controlling procedure. As the number of tested hypotheses H λ grows (e.g., for combinatorially many λ), the harder it is to reject H λ while limiting the probability of false discoveries to δ. Different FWER-controlling procedures have different statistical efficiency/power (i.e., ability to correctly reject H λ when it is false). Angelopoulos et al. (2021) consider a number of FWER-controlling procedures, namely the Bonferroni correction, Fixed Sequence Testing (FST), and Sequential Graphical Testing (SGT)-see Appendix E for a complete discussion. At a high level, the Bonferroni correction assigns an "error budget" of δ/|Λ g | to each possible λ ∈ Λ g . For large |Λ g |, this strict tolerance can result in conservative rejections. FST and SGT attempt to exploit structure in Λ by ordering and testing λ ∈ Λ g in ways that are likely to result in more valid rejections.

as a conference paper at ICLR 2023 Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. Yanqi Zhou, Siavash Ebrahimi, Sercan Ö Arık, Haonan Yu, Hairong Liu, and Greg Diamos. Resource-efficient neural architect. arXiv preprint arXiv:1806.07912, 2018.



2: FLOPs Speed-Up. Accuracy Reduction is controlled by level α, while Flops Speed-Up is maximized. Results are averaged over 10 random splits to calibration and test.

ACKNOWLEDGMENTS

We thank the anonymous reviewers and members of the Barzilay and Jaakkola research groups for helpful discussions and feedback. B.L.G. is supported in part by the CHE Data Science fellowship, the Zuckerman-CHE STEM Leadership Program, the Schmidt Futures Israeli Women's Postdoctoral Award and the Technion Viterbi Fellowship. A.F. is supported in part by the NSF Graduate Research Fellowship. This work is also supported in part by the ML for Pharmaceutical Discovery and Synthesis (MLPDS) Consortium and the DARPA Accelerated Molecular Discovery program.

annex

We briefly describe several possible procedures, some of which exploit a-priori known structure in the hypotheses set.Bonferroni Correction. This is the simplest procedure for counteracting the multiple testing problem, while being also the most conservative. The set of rejected hypotheses retrieved by the Bonferroni correction (Bonferroni, 1936) , is given by:Fixed Sequence Testing. Multiplicity correction can be avoided when relying on a pre-defined ordering of the hypotheses. In FST, the hypotheses are sequentially tested with the same error budget, until failing to reject for the first time. Denoting by H λ (1) , . . . , H λ (|Λg |) the ordered set of hypotheses, FST yields the following set of rejected hypotheses (Holm, 1979) :This procedure is advantageous in the case that there is a natural ordering of the hypotheses from the most likely to be rejected to the least likely one. For example, it can be applied in our problem when n = 1 and the hypotheses are ordered by threshold values from low to high.Sequential Graphical Testing SGT (Bretz et al., 2009) can be viewed as an extension to FST, where the relation between the hypotheses is richer than just a sequential path, and is therefore parameterized by a directed graph G. The graph's nodes are null hypotheses, and the edges connecting between them specify the way the error budget propagates from one node to the other. Each node is allocated an initial error budget. Each time an hypothesis is rejected, the procedure reallocates the error budget from node i to the rest of the nodes according to the edge's weights, and the graph is modified. Several possible graph structures were proposed in (Angelopoulos et al., 2021) for the case of a two-dimensional grid of hypotheses. One option is a 'Hamming graph' in which the ini-

