ZERO-LABEL PROMPT SELECTION

Abstract

Natural language prompts have been shown to facilitate cross-task generalization for large language models. However, with no or limited labeled examples, the cross-task performance is highly sensitive to the choice of prompts, while selecting a high-performing prompt is challenging given the scarcity of labels. To address the issue, we propose a Zero-Label Prompt Selection (ZPS) method that selects prompts without any labeled data or gradient update. Specifically, given the candidate human-written prompts for a task, ZPS labels a set of unlabeled data with a prompt ensemble and uses the pseudo-labels for prompt selection. Experiments show that ZPS improves over prior methods by a sizeable margin in zero-label performance. We also extend ZPS to a few-shot setting and show its advantages over strong baselines such as prompt tuning and model tuning.

1. INTRODUCTION

Recently, extensive studies have shown that large language models (LLMs) have promising performance for few-shot learning (Brown et al., 2020; Zhao et al., 2021; Schick & Schütze, 2021; Gao et al., 2021) , and they even show strong generalization abilities to new tasks without any annotated data (Brown et al., 2020; Wei et al., 2021; Sanh et al., 2021) . Different from conventional fine-tuning methods that require expensive parameter updates for each downstream task, prompts are employed to provide in-context information or task instructions, which is helpful for guiding models to perform each task. Manually-written prompts are often used to specify the task and unify the format of inputs. However, the performance of different prompts during evaluation can vary from near state-of-the-art to random guess; e.g., using a non-optimal prompt can cause a performance drop of up to 60 points on the CB task (Zhao et al., 2021) . Previous work mainly relies on using multiple prompts (Brown et al., 2020; Wei et al., 2021; Sanh et al., 2021) or a prompt ensemble (Zhou et al., 2022) to enhance the performance and robustness when generalizing to test tasks, while omitting the fact that using multiple prompts leads to a substantially increased computational cost, which hinders the practical deployment of LLMs. These challenges make prompt selection an important problem. There have been efforts on improving model performance via searching for a better prompt. For example, Jiang et al. (2020) proposed two automatic methods to augment prompts. They further explored combining the generated diverse prompts with ensemble methods. Shin et al. ( 2020) designed a gradient-based search method to find trigger words in a prompt. Gao et al. ( 2021) developed a way to use a span-corruption pretraining objective for prompt generation. Deng et al. ( 2022) presented RLprompt, a prompt search method with reinforcement learning which relies on a policy network trained with a carefully designed reward function. Prasad et al. (2022) designed an iterative prompt search algorithm that relies on human-defined edit rules to improve the few-shot performance. Xu et al. (2022) proposed GPS, a genetic prompt searching algorithm that leveraged generative language models for prompt augmentation. Nevertheless, the main drawback of such methods is that they all require an additional labeled set to serve as a prompt scoring set or to provide the rewards or gradient signals. It remains challenging when no labeled samples are available. Thus, a crucial question arises: Is it possible to select a high-performing prompt without any labeled data or gradient update? In this paper, we answer this question affirmatively. To tackle the aforementioned problem, we propose ZPS-Zero Label Prompt Selection-a simple-yet-effective technique for selecting a high-Based on this review, would the user recommend this product? === Review: {{content}} Answer: Is there a negative or positive tone to this product review? === Review: {{content}} Answer: Review: {{content}} Does this product review convey a negative or positive sentiment?

Pre-trained

Language Model 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 100% 75% 50% (2) (3) Prompt Ensemble Prompts Unlabeled Data (3) The pseudo labels are used to calculate the pseudo accuracy of prompts. And the one with the highest pseudo accuracy is selected. Some details like prompt filtering are omitted for brevity and can be found in the text. performing prompt in a zero-label setting. As illustrated in Figure 1 , given a set of candidate humanwritten prompts P and an unlabeled dataset X for a task, the ensemble of prompts is used to annotate the unlabeled data. Finally, the pseudo-labels generated from the prompt ensemble are used for prompt selection. We also extend the idea of ZPS to a few-shot setting and validate our advantages over strong baselines such as prompt tuning and model tuning. In the few-shot setting, we further explore the role of pseudo-labeled data: pseudo-labeled data can not only be used for prompt selection, but checkpoint selection as well. By using pseudo-labeled data for checkpoint selection, there is no need to split a subset from limited labeled data as a validation set, which means more labeled data can be used for training to boost model performance. Our contributions are summarized as follows. • We propose a novel Zero-Label Prompt Selection (ZPS) algorithm to select high-performing prompts without an extra validation set or any parameter update. • We show that ZPS can be used in a plug-and-play manner to boost zero-label and few-shot performance. Extensive experiments show that ZPS leads to a substantial performance boost for both the zero-label and few-shot settings.

2. RELATED WORK

Pseudo-labeling. Recently, there have been many advances in deep learning with pseudo-labeling. Pseudo-labeling (Lee et al., 2013; Reed et al., 2015; Shi et al., 2018) employs a model to make predictions for unlabeled samples (Yarowsky, 1995b; McClosky et al., 2006) . Iscen et al. (2019) showed that pseudo-labels can also be created by label propagation instead of direct network predictions. Shi et al. ( 2018) incorporate the idea of confidence levels for unlabeled samples to discount influences from uncertain samples. Another line of work is self-training (III, 1965; Yarowsky, 1995a; Riloff, 1996) , which trains a model with both labeled and pseudo-labeled data for a few iterations. Some modifications like strong data augmentation (Zoph et al., 2020) , learning an equal-or-larger student model (Xie et al., 2020) , and using additional noise (Xie et al., 2020; He et al., 2020) are shown to be beneficial for self-training. Another popular technique is ensemble distillation (Hinton et al., 2015) , which means distilling knowledge in an ensemble into a single model.



Figure 1: The main pipeline of ZPS. (1) A number of prompted unlabeled data are fed into the pretrained language model to get logits and predictions. (2) The pseudo labels are obtained from the prompt ensemble.(3) The pseudo labels are used to calculate the pseudo accuracy of prompts. And the one with the highest pseudo accuracy is selected. Some details like prompt filtering are omitted for brevity and can be found in the text.

