-TION WITH TEN FORWARD PASSES

Abstract

We describe PROMPTBOOSTING, a query-efficient procedure for building a text classifier from a neural language model (LM) without access to the LM's parameters, gradients, or hidden representations. This form of "black-box" classifier training has become increasingly important as the cost of training and inference in large-scale LMs has grown. But existing black-box LM classifier learning approaches are themselves computationally inefficient, typically specializing LMs to the target task by searching in a large space of (discrete or continuous) prompts using zeroth-order optimization methods. Instead of directly optimizing in prompt space, PROMPTBOOSTING obtains a small pool of prompts via a gradient-free approach, and then constructs a large pool of weak learners by pairing these prompts with different elements of the LM's output distribution. These weak learners are then ensembled using the ADABOOST algorithm. The entire learning process requires only a small number of forward passes per batch and no backward pass. Experiments show that PROMPTBOOSTING achieves state-of-the-art performance in multiple black-box few-shot classification tasks, and matches or outperforms full fine-tuning in both few-shot and standard learning paradigms, while training 10x faster than existing black-box methods.

1. INTRODUCTION

Prompt-based learning has emerged as an effective method to adapt pretrained language models (LMs) for downstream natural language processing (NLP) tasks. A typical prompt-learning paradigm involves appending a specially-designed sequence, called a prompt, to the input to a pretrained LM, which will thereby be repurposed for a given downstream task. Compared to the standard fine-tuning, prompt-based learning is much more parameter-efficient. Most prompt-based learning methods require searching for the optimal prompt for the downstream task. When gradient information of the pre-trained LM is available, such optimization can easily be performed by standard gradient-based methods (Liu et al., 2021; Li & Liang, 2021; Lester et al., 2021; Zhang et al., 2021; Liu et al., 2022) . However, in many real-world scenarios, the parameters, gradient or hidden representations of the LMs are not accessible, a.k.a. the black-box tuning setting, which makes gradient-based prompt learning very challenging (Sun et al., 2022) . To tackle the challenges, the most common existing black-box solution is to resort to gradient-free optimization techniques to search for the optimal prompt, such as the zeroth-order gradient approximation (Sun et al., 2022; Diao et al., 2022) and reinforcement learning-guided optimization (Deng et al., 2022) . However, these methods would require a large number of queries of the LMs, which, considering the ever-growing size and computation cost of the pre-trained LMs, is highly inefficient and could lead to large approximation errors. In this paper, we propose PROMPTBOOSTING, a novel black-box prompt learning approach which does not rely on searching an optimal prompt, and which can thus drastically improve the computational efficiency over the existing methods. Figure 1 illustrates the pipeline of PROMPTBOOSTING. Specifically, rather than optimizing over the prompts, PROMPTBOOSTING constructs a small pool of prompts via a gradient-free approach. These prompts are sub-optimal because they are not optimized for any downstream tasks. Then, PROMPTBOOSTING creates a large pool of weak learners by pairing each prompt with different elements of the LM's output distribution, which is commonly Step 1: Weak Learner Construction Step 0: Prompt Generation [Input] It's [Mask] . [Input] A [Mask] film. [Input] A truly [Mask] movie. known as the verbalizer. Finally, these weak learners are ensembled using the ADABOOST algorithm, where the optimization in each iteration is performed only over the verbalizer, not the prompt. The entire process only needs to evaluate the LM's output with each of weak prompts, so it only involves a small number of forward passes per batch, and no backward pass.

Prompt Set

We evaluated our method on a number of downstream tasks and LM's. The results show that PROMPTBOOSTING achieves state-of-the-art performance and matches or even outperforms full fine-tuning in both few-shot and standard learning paradigms. Furthermore, PROMPTBOOSTING can run 10x faster than existing black-box prompt-learning approaches, with only ten forward passes per batch. PROMPTBOOSTING can inspire a new black-box prompt learning paradigm.

2. RELATED WORK

Prompt-based learning Prompt-based learning has emerged as a new approach for adapting pretrained LMs for downstream tasks fueled by the success of GPT-3 (Brown et al., 2020) . Since the prompts directly influence the performance of prompt-based learning, recent studies have focused on how to find the best prompts given a specific task. AutoPrompt (Shin et al., 2020) designs a gradientbased discrete optimization method to search for the optimal prompt. LM-BFF (Gao et al., 2021) leverages the pre-trained T5 model (Raffel et al., 2020) to automatically generate prompts and select the best one based on the performance on the validation set. Since verifying the automaticallygenerated prompts is still time-consuming, the PTR (Han et al., 2021) method incorporates logic rules to construct prompts and to encode prior knowledge into prompt-based learning. Another line of work replaces the discrete prompt tokens with continuous embeddings that have their own parameters. P-tuning (Liu et al., 2021) trains a BiLSTM network to output continuous prompt embeddings. Prefix-tuning (Li & Liang, 2021) inserts prompt embeddings to each transformer layer in LMs and optimizes only the prompt embeddings during training. Prompt Tuning (Lester et al., 2021 ) also keeps the LMs frozen but adds prompt embeddings only in the input. In addition to adding prompt embeddings to each layer, P-tuning V2 (Liu et al., 2022) replaces the language model head in LMs with a linear layer for classification and shows that soft prompt tuning scales to mediumsized LMs and hard sequence tagging tasks. Our work adopts the automatically generated discrete prompts for prompt-based learning.

Black-box Tuning

Extremely large LMs such as GPT-3 are only provided as a service in the cloud, resulting inaccessible parameters and gradients of LMs. Furthermore, from the model provider's perspective, sharing hidden representations or gradients of LMs may reveal the vulnerability of the model and lead to security problems (Tramèr et al., 2016) . How to find the optimal prompts in such black-box tuning setting has attracted various explorations. BBT (Sun et al., 2022) employs the CMA evolution strategy, a derivative-free optimization method, to optimize continuous prompt embeddings. However, BBT requires querying the LM tens of thousands of times even in few-shot settings. Furthermore, BBT's use of soft prompts violates the black-box tuning setting because in realistic use one can query the LM service with only textual input instead of prompt embeddings. Other concurrent work (Diao et al., 2022 ) also uses gradient-free techniques to optimize soft prompts. RLPrompt (Deng et al., 2022) is a more realistic black-box tuning method where discrete prompt tokens are optimized through reinforcement learning and the performance on downstream tasks serves as the reward. GrIPS (Prasad et al., 2022) performs phrase-level editing to generate discrete prompts for downstream tasks. Existing black-box tuning methods suffer from poor efficiency and sub-optimal performance. In contrast, our method achieves high efficiency by first generating only a small set of prompts and achieves superior performance by then creating a set of weak learners from these prompts and ensembling them together via ADABOOST (Freund & Schapire, 1997) . Model ensemble. Model ensembling is a commonly used technique in machine learning. Prior to deep learning, Bagging (Breiman, 1996; 2001) and Boosting (Freund & Schapire, 1997; Friedman, 2001) showed the power of model ensembling. One of these methods, ADABOOST (Freund & Schapire, 1997) , sequentially learns a series of weak learners and ensembles them for better generalization. During training, each weak learner is tweaked by leveraging examples that were misclassified by previous classifiers. Since the performance of each individual prompt can be weak, our method adopts ADABOOST as the framework for learning and ensembling multiple prompts. Prompt ensemble. As has been pointed out by prior work (Lester et al., 2021) , ensembling prompts is more efficient than ensembling entire fine-tuned models. Various ensemble strategies have been explored in past work. Uniformly averaging the predictions from different prompts has been used for factual probing (Jiang et al., 2020 ), text generation (Yuan et al., 2021; Schick & Schütze, 2020) , and classification tasks (Schick & Schütze, 2021a; Lester et al., 2021) . Furthermore, some methods adopt a weighted averaging strategy for better performance-the weight of each different prompt can be learned during training (Jiang et al., 2020; Qin & Eisner, 2021) or defined using some heuristics (Schick & Schütze, 2021a; b) . Our method also falls into the prompt ensemble category. The main difference is each prompt-based model is sequentially learned conditioned on the classification errors of prior models.

3. METHODOLOGY

In this section, we will describe the PROMPTBOOSTING algorithm. For notation, we use |A| to denote the size of a finite set A; [A] to denote an index set {1, 2, • • • , A}.

3.1. PROBLEM FORMULATION AND THE PROMPT-LEARNING FRAMEWORK

Consider a text classification downstream task. Denote D tr = i {(x i , y i )} as the training set, where x i denotes the input text sequence and y i denotes the output label. We are given a pretrained language model, denoted as p i = F * (x i ), which, given the input x i , produces a probability distribution over the vocabulary set, V, at a given location. In this paper, only the output distribution at where the input is [mask] is relevant, so p i ∈ R |V|×1 is just a |V|-dimensional vector specifying the output probability at the [mask] location, where |V| denotes the vocabulary size. Our goal is to adapt the LM F * (•) to the downstream task using the downstream training set D tr . We adopt the common prompt-learning framework, where the parameters of F * (•) are frozen (hence we add a superscript * to emphasize this). Instead, the following two mechanisms are added to convert F * (•) into a text classifier for the given downstream tasks. 1. Prompt A prompt is sequence of tokens that is concatenated to the input. Formally, denote the prompt sequence as q and the concatenated input sequence as x i ∥q. Then the LM is modified as F * (x i ∥q).

2.. Verbalizer

To convert the output probability over the vocabulary into that over the classes, a verbalizer is introduced to assign each token into different classes. Formally, denote the number of classes of the downstream class as |Y|, then the verbalizer is an |Y|-by-|V| matrix, denoted as M , where the element in row c, column v represents the assignment weight of the v-token in the vocabulary into class c. Each row of M would sum up to one. The predicted probability of all the classes can then be expressed as M p i . To sum up, after the prompt and verbalizer are applied, the adapted LM becomes M F * (x i ∥q). Therefore, the prompt-tuning process boils down to learning an appropriate verbalizer M and prompt q for the downstream task.

3.2. ALGORITHM OVERVIEW

Conventional black-box prompt learning methods commonly use a pre-set M while performing black-box optimization over q, which results in a large computation cost. On the contrary, PROMPT-BOOSTING randomly chooses from a small number of pre-generated prompts and performs optimization over M instead. Due to the sub-optimality of pre-generated prompts and the limited representation power of M , the resulting classifiers are weak. However, this process is able to quickly generate a large pool of such weak learners, which can then be ensembled into a strong learner using the ADABOOST approach. Since the optimization over M is computationally cheap, the ensembling process is still much more efficient than the conventional black-box methods. More specifically, the ADABOOST algorithm iteratively generates T weak learners, and weaker learner t is optimized under its respective loss function, denoted as L t (q, M ), which is essentially a weighted loss over the training set with larger weights on those that are misclassified by the previous weak learners (More details of the ADABOOST algorithm will be provided in Section 3.5). Then, as shown in Figure 1 , PROMPTBOOSTING consists of the following key steps. Step 0: Generate a pool of prompts, Q = j {q j }, using a gradient-free method. Step 1: Construct T weak learners. For weak learner t, its prompt q t is uniformly randomly drawn from Q; its verbalizer M is determined by solving min M Lt(qt, M ), s.t. Mcv ≥ 0, ∀c ∈ [|Y|], v ∈ [|V|], c∈[|Y|] Mcv = 1, ∀v ∈ [|V|]. Step 2: Ensemble the weak learners according to ADABOOST. Section 3.3 will describe how to solve equation 1. Section 3.4 will describe how the pool of prompts, Q, is generated.

3.3. LEARNING THE VERBALIZER

As discussed, the loss function L t as in equation 1 is essentially a weighted sum of the individual loss over the training dataset D tr , i.e. Lt(qt, M ) = (x i ,y i )∈D tr wtiℓ(xi, yi; qt, M ), where w ti denotes the weight on training data point i for learning weak learner t as determined by ADABOOST; ℓ(x i , y i ; q t , M ) denotes the loss on data point (x i , y i ) with the parameters set to q t and M . Since we focus on classification tasks, ℓ(•) should ideally be the cross-entropy loss. However, the optimization problem in equation 1 is essentially a partition problem, which can easily lead to combinatorial complexity. To derive the tractable solution, we adopt the following strategy. First, solve equation 1 with ℓ(•) set to the ℓ 1 loss, which, though not optimal for the classification task, bears a closed-form solution. Second, further screen the token assignment by maximizing the training set performance. The detailed method is described below. Minimizing the ℓ 1 loss By replacing the ℓ(•) in equation 2 with the ℓ 1 loss, a closed-form solution can be derived, which can establish a basis for the subsequent steps for deriving a good verbalizer. Formally, let h i be the one-hot representation of the class label y i , and let π i = F * (x i ∥q t ) represent the LM output probability with the prompt q t concatenated. Then, with the ℓ 1 loss, equation 2 becomes Lt(qt, M ) = (x i ,y i )∈D tr wti∥hi -M πi∥1 = (x i ,y i )∈D tr wti1 T |hi -M πi| = (x i ,y i )∈D tr wti (-1) h i T (hi -M πi). Here, 1 represents a all-one column vector of dimension |Y|, and (-1) hi represents the elementwise power operation. The last equality is because each element of M π i is within [0, 1] and each element of h i is either 0 or 1, so we can easily remove the absolute sign depending on the actual values of h i . As shown in equation 3, the loss function is linear with respect to M , so the optimization in equation 1 becomes a linear optimization problem with linear constraints, which has closed-form corner solutions. For notational brevity, define a score matrix, S, as S = (x i ,y i )∈D tr wtiπi (-1) h i T , which is the same size as M and is essentially the coefficients multiplied on M in equation 3. Then, we state without detailed derivations that the solution to equation 1 is such that each token is assigned to the class for which it gets the highest score among all the classes, i.e. M cv = 1, if c = arg max c ′ ∈[|Y|] S c ′ v , and 0 otherwise. ( 5) Since the ℓ 1 loss does not generally work well for classification tasks, we empirically find that the verbalizer derived in equation 5 is of limited performance. However, this inspires us that the score matrix, S, is a good measure of how well each token should be selected for a class. In the following step, we will further screen the tokens with the help of the score matrix. Screening the tokens One issue with the verbalizer in equation 5 is that each token has to be assigned to one class, even those tokens that are not well indicative of any classes. Therefore, by removing the non-informative tokens and only retaining the best tokens for each class, we can improve the verbalizer performance. To reduce the computational complexity, we will retain only one token for each class. Specifically, we first identify a candidate set of tokens for each class by choosing the tokens with top-k scores for that class, i.e. the top-k elements in S c: for class c, where subscript c : denotes the c-th row. Then, we evaluate all the possible combinations that include one token from the candidate set for each class (hence k |Y | combinations in total) and choose the combination that achieves the best training accuracy (weighted by {w ti }).

3.4. CONSTRUCTING THE PROMPT SET

To generate the pool of prompts, Q (step 0 in section 3.2), we adopt the optimization-free method proposed by Gao et al. (2021) , which employs the T5 (Raffel et al., 2020) model. Specifically, we first construct a small subset of the training set, denoted as D gen , to induce the prompt generation (D gen is exactly D tr in few-shot setting). Then, for each data point (x i , y i ) ∈ D gen , we construct an input to the T5 model as <input><A><label token><B> (for sentence-pair classification tasks, the input to T5 becomes <input1><A><label token><B><input2>). Here, <A> and <B> are mask tokens in T5 representing spans to be filled in. <input>, <input1> and <input2> represent the input text x i . <label token> is a pre-defined mapping to convert class labels to tokens in V. For example, positive label (y i = 1) in SST-2 (Socher et al., 2013) dataset is mapped to token great while negative label (y i = 0) is mapped to terrible. Given this input, the T5 model fills in the spans for <A> and <B>. The decoding process aims to maximize output probability conditioned on the input over D gen . Then the T5 generated outputs, denoted as <output A> and <output B> will be converted into prompts and concatenated to the training input text, i.e. x i ∥q, in the form of <input><output A>[mask]<output B> (for sentence-pair tasks, the form becomes <input1><output A>[mask]<output B><input2>). As an example, on SST-2 dataset, one of the generated outputs by T5 is <output A> = A truly, <output B> = movie. Then the input sentence "I love it." will be converted to "I love it. A truly [MASK] movie". With a wide beam search width (by default we use 100), we select the top-10 generated prompts according to the log-likelihood to form the prompt pool, Q. All the generated prompts used in our experiments can be found in Table 6 in Appendix C Readers can refer to Gao et al. (2021) for further details. The entire generation process does not involve any optimization over the prompts, and thus is computationally efficient. It is worth noting that the aforementioned approach can be replaced with any other optimization-free prompt generation methods, such as manually creating the prompts, making PROMPTBOOSTING flexible for realistic use. Randomly draw a prompt qt from Q 6: Learn the verbalizer Mt with weight {wti} 7: Set weak learner t to ft(•) = MtF * (•∥qt) 8: Compute weighted error as err (t) = |D tr | i=1 wti1 y i ̸ =f t (x i ) / |D tr | i=1 wti 9: Compute the weight on ft as αt = log 1-err (t) err (t) + log(|Y| -1) 10: Adjust dataset weight w (t+1)i = wti • exp(αt • 1 y i ̸ =f t (x i ) ), ∀i ∈ [|Dtr|] 11: Re-normalize {w (t+1)i }. 12: end for We follow the ADABOOST algorithm to ensemble the weak learners. As discussed, each weak learner minimizes a weighted loss over the training set (equation 2). The final prediction is produced by taking a weighted average over the weak classifiers' output. Further details, including how the weights are computed, are shown in Algorithm 1. It is worth mentioning that we can generate many weak learners at a very low computational cost, because we only need to evaluate the LM's output distribution with each of the pre-generated prompts in Q, beyond which no extra forward pass is needed when learning each weak learner. Since the number of pregenerated prompts is small, typically ten in our implementation, the entire learning process involves no more than ten forward passes per batch in the training set, no matter how many weak learners are generated.

Datasets

We conduct experiments on a wide range of tasks including sentiment analysis (SST-2 (Socher et al., 2013) and MR (Pang & Lee, 2005) ), topic classification (TREC (Voorhees & Tice, 2000) and AG's News (Zhang et al., 2015) ), and natural language inference (SNLI (Bowman et al., 2015) , MNLI-m (Williams et al., 2018) , RTE (Dagan et al., 2005) , and QNLI (Rajpurkar et al., 2016) ). The dataset statistics can be found in Table 4 in Appendix A. Evaluation Setting We mainly evaluate the performance of PROMPTBOOSTING in few-shot settings. This is reasonable especially for black-box model tuning scenarios, where the maximum allowed query times may be limited. We randomly sample k examples per class from the original training set to construct a k-shot training set D tr for model training. Following previous work (Gao et al., 2021; Zhang et al., 2021; Sun et al., 2022) By default we set k = 16 for our main experiments. Also, while previous work splits the training and validation sets in this way and we do so for direct comparison, we also explore integrating the validation set into the training set-in a truly few-shot setting, we should make full use of as many examples as we can, and we show this leads to an improvement in performance. As for evaluation, we use the whole testing set. For SNLI (Bowman et al., 2015) and the datasets from the GLUE benchmark (Wang et al., 2018) , we use the original validation set for evaluation. Backbone Models In the main experiments, we adopt the widely-used RoBERTa-large model (Liu et al., 2019) for evaluation in order to allow for direct comparison with baselines. Baselines We compare PROMPTBOOSTING with fine-tuning and state-of-the-art black-box tuning methods described below. For reference, we also include white-box prompt-based learning methods that are designed for few-shot setting. Implementation details can be found in Appendix A. (1) Fine-tuning is just standard model fine-tuning in a few-shot setting. (2) LM-BFF (Gao et al., 2021 ) is a prompt-based fine-tuning method. In LM-BFF, all input will be transformed using automatically generated prompts. Then the whole model is fine-tuned based on the transformed data. (3) DART (Zhang et al., 2021) replaces the discrete prompts in LM-BFF with trainable prompt embeddings, which can reduce the prompt generation cost. (4) Feature-based MLP uses LMs as feature extractors, which is a simple yet effective way to train LMs in black-box settings. We take the manual prompt from past work (Gao et al., 2021) and query the LMs to get the the masked token prediction p. Then a 3-layer MLP is trained on the extracted features. (5) BBT (Sun et al., 2022) measured by classification accuracy (%). All methods use RoBERTa-large (Liu et al., 2019) as the backbone LM for fair comparison. Two white-box methods are included for reference including LM-BFF (Gao et al., 2021) and DART (Zhang et al., 2021) . Feature-MLP, BBT (Sun et al., 2022) , and RLPrompt (Deng et al., 2022) a black-box tuning method for few-shot learning which employs zeroth-order gradients to optimize the continuous prompts. ( 6) RLPrompt (Deng et al., 2022) models the black-box optimization of discrete prompts as a reinforcement learning problem and adopts Q-learning to find the best prompt.

Implementation Details

The implementations of standard fine-tuning and feature-based MLP are based on the Huggingface transformers library (Wolf et al., 2019) . For all other baselines, we use their official implementations and hyper-parameters. Please refer to more details in Appendix A. For our method, we sequentially train weak classifiers and add them to our ensemble-we stop when validation performance plateaus or when we reach the maximum number of weak classifiers.

Overall Comparison

We first evaluate the effectiveness of PROMPTBOOSTING in a few-shot setting with experiment results in Table 1 . Although there is some variance across datasets, PROMPT-BOOSTING achieves state-of-the-art performance compared to existing black-box tuning methods. We emphasize the effectiveness of model ensembling in PROMPTBOOSTING. Firstly, on the SST-2 and MR datasets, which are sentiment analysis tasks, even individual weak learners in PROMPT-BOOSTING can achieve 100% accuracy on the training set, making the model ensemble inapplicable (note that AdaBoost cannot ensemble classifiers that achieve 100% accuracy). Therefore, we directly train 10 weak learners using 10 prompts on the unweighted training set and then select the weak learner that performs best on the validation set as the final model. Since the advantage of model ensemble is limited on SST-2 and MR dataset, it is not surprising that PROMPTBOOSTING performs slightly worse than BBT and RLPrompt. However, PROMPTBOOSTING is still better than fine-tuning an-MLP, demonstrating the effectiveness of our proposed verbalizer learning method. Secondly, on the other 6 datasets, PROMPTBOOSTING consistently outperforms all baselines. PROMPTBOOSTING also outperforms standard fine-tuning on the 4 NLI tasks. It is worth noting that on the TREC dataset, all of the black-box baselines performs very badly except for PROMPT-BOOSTING, which even achieves a level of accuracy close to that of white-box methods. One poten- tial reason is that the TREC dataset is harder for prompt-based learning. For example, the manual prompt on the TREC dataset achieves only 32% accuracy Gao et al. (2021) . According to our experiments, individual weak learners trained on the unweighted training set using our verbalizer learning method can only achieve 30%-50% accuracy. However, after employing model ensembling, the performance is largely improved, demonstrating the effectiveness of PROMPTBOOSTING. Finally, we incorporate a variant of PROMPTBOOSTING, namely PROMPTBOOSTING-32, which skips the hyper-parameter tuning and directly integrates the validation set into training. The hyperparameter, i.e., the number of weak classifiers, is determined manually according to its value when the validation set is available. By expanding the training set, one can observe a slight improvement on the performance with lower variance. Deployment efficiency Another concern with black-box model tuning is the deployment efficiency. As we have discussed above, directly adopting zeroth-order gradient optimization techniques suffers from the need to query many times, making it less applicable in realistic scenarios. We visualize the deployment efficiency of different methods in Table 2 . AG's News and RTE datasets are adopted due to the average input length (see Table 4 ). The metrics include parameter efficiency (number of trainable parameters and total parameters), wall time of training, and number of forward/backward passes per batch. In terms of trainable parameters, PROMPTBOOSTING optimizes only less than 1k parameters (|Y| * 200) and does not introduce any extra parameters. In contrast, RL-Prompt uses another network, DistilGPT2 (Sanh et al., 2019) , in addition to the backbone RoBERTa model and consequently increases the training cost. In terms of wall time, PROMPTBOOSTING improves the efficiency over existing black-box tuning baselines (BBT and RLPrompt) by more than 10 times. The query time is also significantly lower. Only 10 forward passes per batch of training data are required during the training of PROMPTBOOSTING. By contrast, our baselines require thousands of forward passes, which makes them hard to use in realistic scenarios.

Effect of training data size

We also study the performance of PROMPTBOOSTING as the size of the training set increases (see Figure 2 ). Note that we still fix k = 16 for the validation set regardless of the training set size. Results on AG's News, TREC, QNLI, and RTE dataset are shown in Figure B in Appendix B. The conclusions are in three dimensions. Firstly, on the SST-2 and MR datasets, PROMPTBOOSTING consistently outperforms fine-tuning with lower variance, demonstrating the effectiveness of our method. Secondly, on the AG News and TREC datasets, PROMPTBOOSTING performs worse than fine-tuning. A similar phenomenon also exists in past work (Gao et al., 2021) , where even a white-box prompt-based few-shot learning method can achieve performance that is at most only comparable with fine-tuning. However, we remark that our method still maintains large advantages compared to all black-box baseline methods and achieves highly usable performance. Finally, as the amount of training data increases, the performance of fine-tuning improves and gradually outperforms our method on the four NLI datasets. This finding is possibly due to the fact that pre-trained LMs before fine-tuning are not good at tasks involving sentence pairs.

Refinement of prompts

The performance of the weak learner in PROMPTBOOSTING directly depends on the prompt. As has been shown in previous work, different prompts have significant influence on the performance of prompt-based methods (Shin et al., 2020; Gao et al., 2021) . However, in PROMPTBOOSTING, the prompts are fixed and will not be optimized during training. Therefore, we consider a simple yet effective way to improve the performance through prompt refinement. Specifically, because we automatically generate 100 prompts for each dataset but only use 10 of them, we may select top-10 prompts following some heuristics to improve the quality of prompts. Before training, we first evaluate the performance on the validation set by training a weak classifier using the method in Section 3.3 on the unweighted few-shot training set. Then we construct the prompt pool by selecting the top-10 prompts according to the accuracy of the corresponding weak learner on validation set. Please note that the few-shot setting makes the refinement process very efficient. Later on, PROMPTBOOSTING is trained using the refined prompts. We mainly evaluate the effectiveness of the prompt refinement on SNLI, MNLI, and QNLI dataset where the gap between PROMPTBOOSTING and standard fine-tuning is relatively large with the increase of training data. Experiment results can be found in Figure 2 . One can observe consistent improvements on few-shot performance across three NLI tasks, especially on QNLI dataset where the performance of PROMPTBOOSTING was far from satisfactory without prompt refinement. Overall, the prompt refinement leads to a trade-off between training cost and model performance. Effect of the number of prompts In our main experiments, we use 10 prompts by default. Intuitively, a large prompt pool increase the diversity of weak classifiers which could improve the performance. However, the training/inference cost will also increase if more prompts are included for model training. We empirically study the relationship between the number of prompts and the model performance in Table 3 . In general, more prompts benefit the performance for most datasets (except QNLI). We highlight the effectiveness of multiple prompts on AG's News and TREC dataset, on which the performance becomes better and more stable. As we have discussed in the few-shot experiments in Table 1 , individual prompt performs very bad on TREC dataset. This is also proved by PROMPTBOOSTING-1 that only achieves 41.3% accuracy. However, by using our prompt ensemble framework, the performance can be boosted to 84.6% when 10 prompts are provided. Finally, the performance improvement is relatively small when the number of prompts increase from 10 to 20, implying that 10 prompts should be good enough for PROMPTBOOSTING. Full data training Due to its efficiency, PROMPTBOOSTING can also generalize to full data training instead of just the few-shot setting. We compare with fine-tuning on the entire training dataset for SST, MR, TREC, and RTE. Experimental results can be found in Table 5 in the Appendix B.

5. CONCLUSION

In this paper, we propose PROMPTBOOSTING, an effective black-box model tuning framework. Without access to the parameters and gradients of pre-trained LMs, PROMPTBOOSTING can adapt LMs for various downstream tasks. The efficient weak learner construction method, together with the ADABOOST model ensemble algorithm, makes PROMPTBOOSTING achieve state-of-the-art performance in black-box tuning setting with at least 10x run-time efficiency. For future directions, we will explore how to generalize PROMPTBOOSTING beyond the classification tasks to generation tasks. Also, we will study how to combine the prompt ensemble idea in PROMPTBOOSTING with gradient-based optimization and further improve the performance of existing prompt-based learning methods. Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263-27277, 2021. Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differentiable prompt makes pre-trained language models better few-shot learners. In International Conference on Learning Representations, 2021. Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NIPS, 2015. Under review as a conference paper at ICLR 2023 

A IMPLEMENTATION DETAILS

Dataset Statistics The dataset statistics can be found in Table 4 . For fair comparison, the fewshot training/validation/testing split generation is strictly following the implementation of Gao et al. (2021) . Training of baselines For standard fine-tuning, we adopt the Huggingface transformers library (Wolf et al., 2019) to load RoBERTa-large backbone model and use its Trainer for finetuning. The learning rate is set to 1e-5. We use AdamW optimizer as the optimizer and the learning rate linearly decays to 0. The training batch size is set to 16 and total training epochs is 100. For Feature-MLP method, we use a three-layer MLP with hidden dimension 100. The learning rate is set to 1e-3 without learning rate decay. We also train the MLP for 100 epochs. For other baselines, we use their the official implementation with default hyper-parameters including LM-BFF (Gao et al., 2021) , DART (Zhang et al., 2021) , BBT (Sun et al., 2022) , and RLPrompt (Deng et al., 2022) . For RLPrompt, because of its low efficiency, we set its training epochs to 1000 instead of the 12000 used in their paper. This is reasonable since it takes nearly 2 hours for RLPrompt to finish 1000 epochs of optimization.

B ADDITIONAL EXPERIMENTS

Performance on full dataset The high efficiency of PROMPTBOOSTING makes it possible to generalize to medium-sized datasets. We evaluate the performance of PROMPTBOOSTING on SST-2, MR, TREC, and RTE datasets. We sample 10% of the original training set to construct the validation set and use the original validation set for testing if the labeled test set is unavailable. The experiment results can be found in Table 5 . PROMPTBOOSTING achieves comparable performance with standard fine-tuning on SST-2 and MR datasets, which is impressive given the fact that PROMPT-BOOSTING has no access to the parameters and gradients of the LM. For TREC dataset, standard fine-tuning outperforms PROMPTBOOSTING, but we still remark that the performance is still highly usable in black-box setting. Finally, the gap between PROMPTBOOSTING and fine-tuning is relatively large on RTE dataset, which is consistent with our previous discovery that it seems pre-trained LMs are not good at sentence pair classification tasks before fine-tuning. 



Figure 1: Overview of PROMPTBOOSTING.

ENSEMBLING THE WEAK LEARNERS Algorithm 1 Model Ensemble in PROMPTBOOSTING 1: Input: prompt set Q = j {qj}, LM F * (•), Dtr, 2: Output: weak learners t {ft(•)} and their weights t {αt}. 3: Set initial data weight to w1i = 1/|Dtr|, ∀i ∈ [|Dtr|] 4: for Iteration t = 1, . . . , T do 5:

, we also construct the validation set D val by randomly sampling another k examples per class from the original training set (i.e., |D tr | = |D val |).

Figure 2: Model performance as a function of training set size on different datasets. For NLI tasks (SNLI and MNLI), we also include prompt refinement for better performance.

Effect of training data size For AGNews, TREC, QNLI, and RTE datasets, we shown the performance of PROMPTBOOSTING as the size of the training set increases in Figure B.

is Performance of proposed PROMPTBOOSTING and baseline methods in few-shot setting (k = 16)

are the main black-box baselines. PROMPTBOOSTING-32 combines both training and validation set for training. Mean accuracy (and standard deviation) is reported over 5 different splits.

Deployment efficiency of proposed PROMPTBOOSTING and baseline methods in few-shot setting (k = 16). With all methods use RoBERTa-large (335M parameters) as the backbone LM, some baselines introduce additional parameters, leading to the slight variation in total parameters. Wall time is reported to measure the training time efficiency. Query efficiency is evaluated by #Forward and #Backward, which refer to the number of forward/backward passes per batch during training respectively.

Performance of PROMPTBOOSTING with different numbers of prompts in few-shot setting (k = 16).PROMPTBOOSTING-d means top-d prompts (sorted according to the beam search score) are used for model training. Mean accuracy (and standard deviation) is reported over 5 different splits.

The dataset statistics. |Y| is the number of classes, Avg.#W is the average number of words in the input, and #Train/#Test refer the number of examples in training/testing dataset.

Performance of full data training

REPRODUCIBILITY STATEMENT

The authors have made great efforts to ensure the reproducibility of the experiment results in the paper. Firstly, the experiment settings, evaluation metrics, benchmarks, etc. are explained in detail in Section 4.1. In summary, eight public datasets with clear references are used for evaluation in few-shot setting. Secondly, the implementation details and hyper-parameters of our method and baselines are clearly presented in Section B and Appendix A. We also include all prompts that are used for our method in Appendix C. Thirdly, all experiments are based on 5 runs with different random seeds to improve the reliability. Finally, the pre-released code for the proposed method is also included in the supplemental material. 

C GENERATED PROMPTS

In this section, we visualize the prompts we used in our experiments for each dataset in Table 6 .Regardless of different few-shot training/validation splits, we use the same 10 prompts for model training.

