EFFICIENTLY CONTROLLING MULTIPLE RISKS WITH PARETO TESTING

Abstract

Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyperparameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiably high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes-including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered-to simultaneously control and optimize various accuracy and cost metrics.

1. INTRODUCTION

Suppose you want to deploy a modern machine learning model in a real-world environment. As a practitioner, you may frequently have to weigh several performance considerations (Jin & Sendhoff, 2008; Ribeiro et al., 2020; Min et al., 2021) . For example, how much computational budget can you spend? What accuracy do you require? How large, if any, of a discrepancy in predictive performance across different groups of end-users can you tolerate? Often models are equipped with hyper-parameter configurations that provide "knobs" for tuning different aspects of their performance, depending on how such questions are answered. As the number of parameter dimensions and objectives grow, however, choosing the right set of parameters to rigorously control model performance on test data in the intended ways can become prone to error. To address this challenge, the recently proposed Learn Then Test (LTT) framework of Angelopoulos et al. (2021) combines any type of parameterizable predictive model with classic statistical hypothesis testing to provide an algorithm for selecting configurations that lead to provable distribution-free, finite-sample risk control of any user-specified objective. Nevertheless, while theoretically general, a key pair of practical challenges arises when the space of parameters to explore and constraints to satisfy are large. The first is that evaluating all possible configurations can quickly become intractable, while the second is that the statistical tests relied upon to guarantee risk control can quickly lose power-and fail to identify configurations that are also useful for the task at hand. In this work, we build upon the results of LTT by introducing Pareto Testing, a simple procedure that can provide a computationally and statistically efficient way to identify valid, risk-controlling configurations with (specifiably) high probability, which, critically, are also useful with respect to other objectives of interest. Our method consists of two stages. In the first stage, we solve an unconstrained, Figure 1 : A demonstration of our calibration procedure applied to multi-dimensional adaptive computation in a Transformer model (left). Here we have the option to drop tokens from the input, make an "early-exit" prediction after processing a subset of the layers, or only compute a subset of the self-attention heads in each layer in order to do faster inference. Our calibration procedure (right) applies multi-objective optimization to identify a Pareto frontier of configurations with different performance profiles, and then applies statistical testing to efficiently identify a subset of "risk-controlling" configurations with high probability (e.g., bounded error rates). multi-objective optimization problem in order to recover an approximate set of Pareto-optimal configurations, i.e., settings for which no other configuration exists that is uniformly better in all respects. Here we can exploit standard multi-objective optimization methods to efficiently explore and filter large parameter spaces to only its most promising configurations. In the second stage, we perform rigorous sequential testing over the recovered set, which we empirically find to yield tight control of our desired risks, while also giving good performance with respect to our free objectives. 1We apply our approach to adaptive computation in large-scale Transformer models (Vaswani et al., 2017) for natural language processing (NLP), see Figure 1 . While larger models generally perform better, they can also be incredibly computationally intensive to run (Bapna et al., 2020; Schwartz et al., 2020; Moosavi et al., 2021) . Often, however, not every application, domain, or example requires the same amount of computation to achieve similar performance. As such, many techniques have been proposed for accelerating computation, including attention head pruning, token dropping, or early exiting (Graves, 2016; Xin et al., 2020; Hou et al., 2020; Goyal et al., 2020) . Still, determining the extent to which to apply different modifications while still preserving good performance can be tricky. Our proposed procedure allows the user to jointly configure multiple model settings subject to multiple statistical guarantees on model performance-such as average and worst-case relative reductions in accuracy (e.g., so that the adaptive model is within 5% of the full model's accuracy), average inference cost (e.g., so that the adaptive model uses less than a certain number of FLOPS on average), or maximum abstention rates in selective prediction settings. Contribution. The core idea and contribution of our work can be summarized quite plainly: 1. Our framework leverages statistical testing techniques via the LTT framework (Angelopoulos et al., 2021) to identify valid risk-controlling hyper-parameter configurations; 2. To improve efficiency, we introduce Pareto Testing, our main contribution, as a way to efficiently guide the number and order of configurations that we test when searching for valid settings; 3. We demonstrate the scalability and effectiveness of our method in managing trade-offs in multidimensional adaptive computation in NLP applications with large-scale Transformer models; 4. On diverse text classification tasks, we empirically achieve tight, simultaneous control of multiple risks while also improving performance on any non-controlled objectives, relative to baselines.



If we fail to find any valid configurations (which may not exist) with the right confidence, then we abstain.

