SELF-GUIDED NOISE-FREE DATA GENERATION FOR EFFICIENT ZERO-SHOT LEARNING

Abstract

There is a rising interest in further exploring the zero-shot learning potential of large pre-trained language models (PLMs). A new paradigm called data-generationbased zero-shot learning has achieved impressive success. In this paradigm, the synthesized data from the PLM acts as the carrier of knowledge, which is used to train a task-specific model with orders of magnitude fewer parameters than the PLM, achieving both higher performance and efficiency than prompt-based zero-shot learning methods on PLMs. The main hurdle of this approach is that the synthesized data from PLM usually contains a significant portion of low-quality samples. Fitting on such data will greatly hamper the performance of the taskspecific model, making it unreliable for deployment. Previous methods remedy this issue mainly by filtering synthetic data using heuristic metrics(e.g., output confidence), or refining the data with the help of a human expert, which comes with excessive manual tuning or expensive costs. In this paper, we propose a novel noise-robust re-weighting framework SUNGEN to automatically construct high-quality data for zero-shot classification problems. Our framework features the ability to learn the sample weights indicating data quality without requiring any human annotation. We theoretically and empirically verify the ability of our method to help construct good-quality synthetic datasets. Notably, SUNGEN-LSTM yields a 9.8% relative improvement than the baseline on average accuracy across eight different established text classification tasks.

1. INTRODUCTION

Owing to the superior generative capacity of large-scale pre-trained language models (PLMs), there has been an emerging trend of using these powerful models (e.g., GPT) to generate training data for downstream tasks (Anaby-Tavor et al., 2020; Puri et al., 2020; Kumar et al., 2020; Lee et al., 2021, inter alia) . Among them, a new line of generation-based zero-shot learning using the unfinetuned PLM pushes the envelope further (Schick & Schütze, 2021; Ye et al., 2022a; Meng et al., 2022) , featuring total annotation-free training for downstream tasks. Ye et al. (2022a) (ZEROGEN) further boosts the efficiency by using the generated data to train tiny task models (TAM), which have ordersof-magnitude fewer parameters than the PLM. Specifically, they first design prompts incorporating the task description and label information, then use them to guide the data generation from the PLM. Subsequently, the synthesized dataset is used to train the tiny task-specific models. Compared with the classic prompt-based zero-shot learning on PLM, this new paradigm enjoys two favorable properties: (1) since the task model has orders-of-magnitude fewer parameters than the PLM, it demonstrates much lower inference latency; (2) with the large amount of PLM-generated training data, the task model often shows better performance than prompt-based zero-shot PLM counterparts. In the above paradigm, the amount and the quality of the generated data are crucial factors for the task model's performance. Unfortunately, despite the unlimited training data that one can generate in theory, the data quality is not always guaranteed. Our experimental observation across many downstream tasks verifies the existence of this issue: in ZEROGEN, after a few training epochs on the PLM-generated dataset, although the training accuracy steadily improves, the actual test accuracy of the model starts declining rapidly (e.g., IMDb in Figure 1 ) -a clear indication of the model overfitting to low-quality data (noisy data) (Arpit et al., 2017) . More specifically, we identify two major cases of noisy samples in the synthetic dataset: corrupted labels and task-irrelevant samples (Table 6 in Appendix) . Without any task-related fine-tuning, it is challenging for PLM to follow a user's instruction (task-specific prompt including label information) to generate accurate samples in the target domain (Ouyang et al., 2022) . To alleviate the data quality issue, recent work adopts human-active labeling to correct the corrupted label or revise the example (Wang et al., 2021a; Liu et al., 2022) . However, such methods introduce considerable costs and may be unrealistic. To avoid human intervention, the classic approach to eliminate the effect of noisy data is to re-weight the samples. The core idea is to design a weighting function w, such that the correct samples are associated with larger weights and the noisy ones with smaller weights. Compared with heuristic design of w (e.g., according to output confidence, loss value) (Liu & Tao, 2015; Wang et al., 2021b) , which requires taskspecific knowledge and excessive manual tuning, the adaptive methods that learn the sample weights in an end-to-end manner demonstrate better performances in practice (Ren et al., 2018; Shu et al., 2019; Zheng et al., 2021) . Those methods typically formulate the learning of sample weights into a bi-level optimization problem, with a clean validation set in the outer loop to guide the learning of w. Despite remarkable success was achieved, their dependence on a clean validation set becomes a major limitation, which is especially impractical in zero-shot setting. Our solution comes from rethinking the choice of the outer objective in the bi-level framework: can we design an objective such that the sample weights can be optimized with only access to the noisy synthetic data? To this end, we resort to a family of noise-robust loss functions (ℓ robust ) (Ghosh et al., 2017; Zhang & Sabuncu, 2018) . These functions were adopted by previous work to train the neural network under label noise due to their theoretically noise-tolerant property (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Wang et al., 2019) . However, from the optimization point of view, such loss functions suffer from instability and difficulty when training the neural networks (Zhang & Sabuncu, 2018) , which limits their effectiveness. Remarkably, our approach leverages the noise-tolerant property of these losses, while avoiding their pathology. We propose a novel bi-level re-weighting framework SUNGEN: in the inner loop, we train the task model using weighted training loss based on current sample weights; in the outer loop, the noise-robust loss is adopted to guide the learning of the sample weights. The two procedures are performed alternatively to generate a set of weights indicating the importance of samples. Notably, our method focuses on enhancing the quality of generated data, while improving the generator (e.g., modify PLM parameter, prompt engineering) is an orthogonal direction and can be applied jointly with our method. Our main contributions are threefold. First, we propose a novel end-to-end framework to construct high-quality noise-free synthetic dataset, without the aid of any human annotation ( §3). Second, we offer theoretical justification ( §4) and empirical verification ( §5.2) for SUNGEN's ability of recovering a noise-free dataset reliably with synthetic data only. Third, we conduct experiments on eight text classification datasets and show our method outperforms the current baseline by large margins ( §5.2).

2.1. PROMPT-BASED ZERO-SHOT LEARNING

We first introduce prompt-based zero-shot prediction (named PROMPTING). Given a manuallydesigned prompt T (•) and a query example x i ∈ X , PROMPTING constructs a sentence T (x i ) (e.g., "The movie review in <y i > sentiment is: <x i > "). The PLM P is expected to model the probability distribution of a set of label words y i ∈ Y (e.g., "positive", "negative") and select the class with highest probability for x i . Even though PROMPTING has achieved remarkable success in zero-shot learning (Radford et al., 2019; Brown et al., 2020) , it is difficult for PROMPTING to fully leverage the task-specific knowledge from PLMs. Besides, PROMPTING still needs to conduct inference on a cumbersome PLM.

2.2. EFFICIENT ZERO-SHOT LEARNING VIA DATA GENERATION

A new line of research (Ye et al., 2022a; Meng et al., 2022; Ye et al., 2022b) endeavors to make zeroshot learning more practical and efficient, among which the generative efficient zero-shot learning paradigm proposed by ZEROGEN (Ye et al., 2022a ) is as follows: Synthetic Data Generation. Given a task, the paradigm first generates a synthetic dataset S syn = (X syn , Y syn ) with the help of large-scale PLM P and task-related prompts. The idea is to use model P to generate the input x syn based on a pseudo label y syn . For example, in text classification task, a class label y syn is uniformly sampled: y syn ∼ U(y 1 , y 2 , . . . , y K ), where K is the number of classes. The pseudo label y syn is transformed into a label-descriptive prompt T (y syn ) to generate x syn : x syn ∼ P(•|T (y syn )). The generated x syn and pseudo label y syn are paired to construct a pseudo training dataset S syn . Efficient Training and Inference. To achieve efficiently training and inference, the paradigm then trains a tiny task model (TAM) (e.g., 1-layer Bi-LSTM) on the synthetic dataset S syn . TAMs typically have much fewer parameters, which is thus easy to train and more efficient during inference. Although ZEROGEN achieved promising results by training a TAM with S syn , we find that model's performance rapidly declines after several training epochs, indicating overfitting to noisy samples (Figure 1 ). A classic approach to improve the quality of a dataset containing low-quality data is to re-weight the samples using some function w. Intuitively, if we assign large weights to correct samples and small weights to noisy ones, the negative influence of noisy samples during training can be reduced. However, the heuristic design of w (e.g., according to output confidence, loss value) suffers unstable performances and requires task-specific knowledge (Shu et al., 2019) . Therefore, our paper seeks to optimize w automatically.

3. METHOD

To automatically learn sample weights, the bi-level optimization approaches (Ren et al., 2018; Shu et al., 2019) have been proven to be effective. However, such methods are impractical in our zero-shot scenario, since they depend on a clean validation set to guide the optimization of sample weights. To circumvent the absence of human-labeled data, we propose a Self-gUided Noise-free data GENeration framework, named SUNGEN (Figure 2 ). Concretely, we propose a sample reweighting framework via bilevel optimization using a noise-robust loss (Ghosh et al., 2017; Wang et al., 2019; Zhang & Sabuncu, 2018) as the outer objective. Due to the appealing property of such loss functions, our method is able to learn meaningful sample weights with only a synthetic (noisy) validation dataset. Notations. As elaborated in Section 2.2, we first generate a synthetic dataset using a left-to-right PLM and task-related prompts for the given task. Let S syn , S clean ⊂ X × Y = R d × {1, ..., K} denote distribution of synthetic (noisy) and clean (gold) data, respective. Here d is the dimension of the input space and K is the number of classes. We draw a training dataset S t syn := {(x, y) (i) } N i=1 and a validation dataset S v syn := {(x, y) (j) } M j=1 from S syn . Denote f (x, θ) as the classifier (TAM) with θ as its parameters. w ∈ W := {w(•, •) : X × Y -→ [0, 1]} is a re-weighting function that assign a weight to each sample. The bold w is a sample weight vector of length N indicating training samples' importance, which contains per-sample weight w i := w(x i , y i ). Overall Framework. Without the presence of clean validation set, our proposed bi-level optimization framework SUNGEN (as shown in Figure 2 ) can be outlined as follows:  w * ∈ arg min w Lrobust( θ(w), S v syn ) = arg min w 1 M {x,y}∈S v syn ℓrobust(f (x, θ(w)), y) where w * denotes the optimal sample weights, which is obtained from the outer loop (Eqn. 1); θ(w) denotes the classifier's parameters after weighted training with w, which is obtained from the inner loop (Eqn. 2); ℓ robust denotes the noise-robust loss calculated on validation set; ℓ ce denotes the cross-entropy(CE) loss calculated on training set. The whole process is: (1) in the inner loop(Eqn. 2), we fix w and optimize the classifier f (x, θ) using the weighted loss L ce over a synthetic training set S t syn , and derive the trained classifier θ(w); (2) in the outer loop(Eqn. 1), we calculate noise-robust loss L robust on the synthetic validation set S v syn at the optimized θ(w). The outer loss is minimized to guide the optimization of w, and the outer gradient ∇ w L robust is calculated via truncated backpropagation(Appendix H). The two procedures are performed alternatively for T iterations. Notably, our framework prevents the need for a clean validation set, which is the main obstacle for the previous bi-level re-weighting approaches under zero-shot scenario (Ren et al., 2018; Shu et al., 2019) . The magic of this appealing feature lies in the choice of the outer objective L robust in Eqn. (1). Recall that the goal of sample reweighting is to find w * , with which the model f (x; θ) can be trained using weighted loss over the synthetic training set, such that the model performs well on the clean data (from the same distribution as test data). Evaluate θ(w t ) on S v syn , then obtain meta gradient ∇ w L robust via Eqn. (11).

5:

Update w using ∇ w L robust by gradient descent w t+1 ← w t -λ∇ w L robust 6: end forOutput: Optimized sample weights w * . In Sec. 4, under the condition that the majority of data are correctly labelled (Ghosh et al., 2017; Wang et al., 2019) , we theoretically show that using L robust as the outer objective, our method can find a set of sample weights w * with just the synthetic validation set, such that w * maximizes the model performance on the clean data. The whole procedure is illustrated in Algorithm 1. Noise-robust Loss Functions. In the noisy label learning literature, there is a family of loss functions that possess the following property: K j=1 ℓrobust(f (x, θ), j) = C, ∀θ, x, where f (•, θ) denotes a classifier, x is the input, K is the number of classes, and C is a constant. Previous work (Wang et al., 2019; Zhang & Sabuncu, 2018; Ghosh et al., 2017) has shown that these loss functions have consistent global minimum under label noise. More formally, when the majority of the training samples are correctly labelled, the global minimizer (θ * ) of ℓ robust is the same regardless of whether the training data is clean or noisy. This noise-robust property enables these losses to optimize the sample weights given only a noisy validation dataset. Detailed proof will be given in Sec. 4. In particular, we consider the reversed cross-entropy loss ℓ rce with the following form for sample (x, y): ℓrce(f (x, θ), y) = - K k=1 f k (x, θ) log q(k|x), where f k (x, θ) = e z k K i=1 e z i denotes the predicted probability for each label k ∈ {1, ..., K} , and z i are the logits. We denote the ground-truth distribution over labels by q(k|x), and K k=1 q(k|x) = 1. Given the ground-truth label is y, then q(y|x) = 1 and q(k|x) = 0 for all k ̸ = y. log( 0) is approximated to a constant A. One can easily check that ℓ rce satisfies Property (3), and C = -(K -1)A in this case. Remark. Even though ℓ robust is theoretically noise-tolerant, it has been shown that using them to train the network parameters θ leads to difficulty in optimization and hampers the network's performance (Wang et al., 2019; Zhang & Sabuncu, 2018) . Remarkably, adopting ℓ robust as the objective in the outer loop of Eqn. (1) implicitly overcomes the side effects of ℓ robust . Since ℓ robust is now used to optimize the sample weights w, which is in a much smaller search space and has simpler structure than θ, thus is easier to optimize. This "decouples" noise removal from network training, and thereby prevents hampering the TAM's performance. Clean Subset Sampling. Our framework enables us to derive a set of continuous weights w, which encodes the data quality and can well separate the noisy samples from clean ones, as shown in Figure 3 . With those weights, we can sample subsets with arbitrary budgets and use them to train the task models with unweighted CE loss. More specifically, suppose the budget is D, we first normalize w i to w ′ i (i.e. 

4. THEORETICAL ANALYSIS

Though we have no clean data as validation set, our method still enjoys favorable theoretical properties. Recall that S clean and S syn are as the clean and synthetic (noisy) distribution, respectively. We further denote L(θ, S) = E (x,y)∼S [ℓ(f (x; θ), y)] given a data distribution S. Let the optimal network parameters obtained with CE loss over clean dataset be θ * := arg min θ L ce (θ, S clean ). We assume the θ * is unique, or we focus on the θ * with minimum norm. Property 1. We call a loss function ℓ a robust loss, if under mild conditions as in (Ghosh et al., 2017) , its minimizer on the noisy dataset collides with the one on the clean dataset, i.e., arg min θ L robust (θ, S clean ) = arg min θ L robust (θ, S syn ), Previous work on noise-robust loss functions (Ghosh et al., 2017; Wang et al., 2019; Zhang & Sabuncu, 2018; Xu et al., 2019) have shown that the losses satisfying Eqn. (3) have the above noise-tolerant property. We give the detailed assumptions and complete proof in Appendix A.2. Assumption 1. Let P clean and P syn denote the probability density function of S clean and S syn , respectively. There exists a weighting function w * such that P clean (x, y) = w * (x, y)P syn (x, y). Assumption 1 is reasonable because synthetic data generated by the PLM has a wide coverage. Therefore, with a proper weight function w, we may recover the clean distribution by reweighting the synthetic data. Assumption 2. The optimal θ * uniquely minimizes L robust (θ, S clean ), i.e., L robust (θ * , S clean ) < L robust (θ, S clean ) for all θ ̸ = θ * . Assumption 2 is natural since the robust losses are originally designed to train the model. Thus minimizing L robust (θ, S clean ) is expected to achieve promising performance, as justified by Ghosh et al. (2017) (though the optimization difficulty mentioned in the Remark of Section 3 poses challenges for model training). We also experimentally show that when training a model with L ce (θ, S t clean ), L robust (θ, S t clean ) also decreases, and reaches the plateau at a similar point (Appendix A.1). This indicates that L ce (θ, S clean ) and L robust (θ, S clean ) have close optimal solutions. Note that even if θ and θ * differ by a small quantity, i.e., L robust (θ * , S clean ) < L robust (θ, S clean ) + ε holds with a small ε, the following proof holds with just minor modifications. Theorem 1. If Assumption 1 holds, there exists a w * such that θ(w * ) = θ * . Further with Assumption 2 and Property 1, our method can uniquely return w * and the resulting θ * with only synthetic (noisy) data S noisy . Given Assumption 2 holds, L ce (θ, S clean ) and L robust (θ, S clean ) have consistent optimal solution θ * . Furthermore, Property 1 of L robust indicates that θ * can be found with just the noisy synthetic data S syn . Therefore, we can optimize L robust (θ, S syn ) to find θ * . Finally, since θ * can be parameterized by w * (Theorem 1), which is achieved by the inner loop (Eqn. ( 2)), we can optimize L robust ( θ(w), S syn ) over w in the outer loop (Eqn. ( 1)) to find the optimal corresponding w * as well as the θ * . See Appendix A.3 for detailed proof. The above analysis theoretically shows our SUNGEN is able to learn the optimal sample weights w * and the resulting optimal model parameters θ * with just synthetic data. Notably, Theorem 1 is based on population losses with infinite samples. We further characterize the generalization bound when there is finite samples: Theorem 2 (Finite Sample Generalization). Suppose we have access to synthetic datasets S v syn and S t syn both with finite samples N . Let θ * (w) be the deterministic mapping from w to θ defined in the inner of Eqn.2 given Ŝt syn . Assuming the output of loss function ℓ robust is upper bounded by M , the carnality of W is |W|, the outer loop of Eqn.1 is solved within ϵ-approximately, and the solution ŵ satisfies the following condition: L robust ( θ * ( ŵ); S v syn ) ≤ min w L robust ( θ * (w); S v syn ) + ϵ, we then have with probability at least 1 -δ, L robust ( θ * ( ŵ), S syn ) ≤ min w L robust ( θ * (w); S syn ) + ϵ + κ 2 ln(|W|/δ) M Refer to Appendix A.4 for full proof. Theorem 2 characterizes the generalization ability of SUNGEN. If we obtain an approximate solution of the bilevel problem Eqn.( 1)-( 2) given finite samples, Eqn. ( 4) shows that such solution has a test performance close to the oracle model.

5.1. SETUP

Datasets & Baselines. We evaluate SUNGEN on eight text classification tasks, including IMDb (Maas et al., 2011) , SST-2 (Socher et al., 2013) , Rotten Tomatoes (Pang & Lee, 2005) , Amazon (McAuley & Leskovec, 2013) , Yelp (Zhang et al., 2015) , Subj (Pang & Lee, 2004) , AGNews (Zhang et al., 2015) and DBpedia (Zhang et al., 2015) . These tasks have various number of classes, ranging from 2 to 14. Other details about datasets are in Appendix F. We compare our proposed method with the following baselines: (1) PROMPTING. The prompt-based zero-shot classification method based on PLMs (Brown et al., 2020; Gao et al., 2021b) . ( 2) ZEROGEN. A recent zero-shot learning work via dataset generation (Ye et al., 2022a) . Implementation Details. We compare the baselines using GPT2-XL (Radford et al., 2019) as PLM. For text generation, we use Nucleus Sampling (Holtzman et al., 2020) with p = 0.9 as the decoding strategy and use GPT2-XL as the generator. To make a fair comparison, we use the best prompts designed by (Ye et al., 2022a) in both PROMPTING and data-generation settings (Appendix Table 10 ). For task model training, we use 1-layer Bi-LSTM and DistilBERT-base as the lightweight classifiers. The bilevel procedure is iterated 50 times for each task. For more details(e.g., full prompts, training details), please refer to Appendix F.

5.2. EXPERIMENTAL RESULTS

Main Experiments. We present our main experiment results in Table 1 . We observe that our SUNGEN achieves considerable performance gain over the ZEROGEN baseline across all the tasks. Interestingly, the improvement is more prominent for LSTM compared with DistilBERT, which shows relative improvement over ZEROGEN by 9.8% on average across all tasks. We conjecture the reason is that the pre-trained models such as DistilBERT are inherently more robust to noisy training data, which is also pointed out in (Hendrycks et al., 2019) . Surprisingly, on Rotten Tomatoes and Yelp, SUNGEN-LSTM even outperforms ZEROGEN-DistilBERT, which requires no pre-training and has much fewer parameters. Effectiveness of SUNGEN on Constructing Noise-Free Data. Figure 1 shows that while ZeroGen suffers from overfitting on noisy data, the problem disappears when training with the subset selected by SUNGEN. With longer training epochs, the data of SUNGEN consistently helps improve the performance of TAM on test set and significantly surpasses the result of ZeroGen, which proves that SUNGEN can effectively construct noise-free high-quality data. Synthetic Data vs. Gold Data Supervision. A key advantage of our proposed framework is that we do not require gold data to optimize sample weights. Here, we compare our SUNGEN to the bilevel reweighting method which uses gold data for calculating the outer objective in Table 2 . Specifically, our SUNGEN calculates the robust outer objective ℓ rce on the noisy synthetic data, while the counterpart calculates the normal outer objective ℓ ce on gold data. Table 2 shows that our method achieves similar performances as the counterpart, which verifies that our method using the noise validation set can equivalently supervise the optimization of sample weights as the clean validation set does. SUNGEN vs. Popular Denoise Baselines. We compare with other de-noise methods, which are (1) directly removing data with low predicted confidence (Swayamdipta et al., 2020) , (2) Co-Teaching (Han et al., 2018) , which distinguishes noisy data by large loss value, and (3) Meta-Weight-Net (Shu et al., 2019) , which relies on meta samples and a loss-based proxy model to learn the sample weights. Since in zero-shot setting we have no access to gold data, we use the synthetic data as validation set for Co-Teaching and Meta-Weight-Net. Results in Table 3 prove our method's superiority in zero-shot setting. We further analyze why the loss-value-based methods are not able to distinguish noisy data in our situation in Appendix L.

5.3. ABLATION STUDY AND ANALYSIS

Performance of Different Data Sizes. Table 4 shows that even with much less data, model trained by SUNGEN achieves better performance than ZEROGEN. For example, in IMDb, by removing low-quality data, subset with 2% data achieves better performance than the original data (20k vs. 1,000k), which shows superior data quality of SUNGEN. Besides, from Figure 3 (d), we find the percentage of erroneous data is small. However, those erroneous data significantly degrades the model performance: without weighted training to remove the effect of noisy data, the IMDB accuracy of full set is decreased from 86.56 to 78.29. Analysis of Data Diversity and Correctness. Table 5 shows that our selected noise-free dataset is more diverse than data generated by ZEROGEN. One interesting thing to note is that in Amazon and Yelp, average correctness of SUNGEN is slightly lower than ZEROGEN. In addition, the top samples (with highest weights) have slightly lower correctness than average. This is expected as the data with highest correctness may be redundant or too simple, while the challenging and informative samples are often harder to classify and thus have lower correctness. The results further verify that SUNGEN effectively separates the informative samples from the ones that are redundant or erroneous. This can not be done with the heuristic methods that manually sets a threshold to separate clean and noisy data, which may keep redundant samples and remove hard ones. Analysis of Removed and Selected Examples. We take IMDb (sentiment classification task of movie review) as the example task and find that the majority of samples associated with small weights are wrongly labeled, as shown in Table 6 . Besides, there is a small part of erroneous data which contains task-irrelevant text, meaning the generated text is not a movie review and has no obvious emotional tendency. For samples with large weights, we actually find them to be well-written transitional complex sentences (bottom part of Table 6 ), which verifies that SUNGEN tends to select correct and important samples. The film does a great job of capturing the fear and battle that so many U.S. troops have experienced during the seven-year war in Afghanistan. Neg. Noisy Y This long, pompous, chain-smoking movie makes a big hit out of not very much at all. Pos. Noisy Y One of the worst cult films ever made. 2D CGI animations of zombies and comic-book characters. Some bad acting, technical problems, cheap gimmicks and script that is. Pos.

Noisy Y

The four-hour look at family dysfunction brings forth a richer character study. Neg Unrelated X A fresh and visceral portrait of a man trying to make sense of his life. Neg Unrelated X Selected Data (Data with Large Weights) Despite its oddball structure and topsy-turvy interactions between characters, this surprisingly zany animated film succeeds where so many animated films fail.

Pos. No Noise

Not worth the time for the main actors, but for the almost movie has a very good story that puts many sci-fi movies of the past to shame. An outrageously stupid movie that has been thoroughly disproved by fact and logic. Neg. No Noise Wonder Woman is a big-budget superhero blockbuster that turns the spotlight on the potential of a woman leader. . . but the movie is ultimately unfulfilling and laden with female stereotypes.

Neg.

No Noise While a satire on class and power, the film's dismissal of human misery is shallow and the actors' portrayal of the weak and needy are gratingly self-pitying. Neg. No Noise

6. RELATED WORK

Zero-shot Learning via PLM. The popular prompt-based zero-shot prediction is proposed by GPT (Radford et al., 2019) . With well-designed prompts, large-scale PLMs have shown its notable zero-shot learning ability in various tasks (Jiang et al., 2020; Shin et al., 2020; Reynolds & McDonell, 2021) . More recently, data-generation-based work via PLM has gained popularity and shown superior ability on synthesizing task-specific data (Anaby-Tavor et al., 2020; Puri et al., 2020; Kumar et al., 2020; Lee et al., 2021; Wang et al., 2021b; Yoo et al., 2021; Bonifacio et al., 2022) . Apart from work that still relies on task-related human-annotated data to instruct or fine-tune the generative PLM, a recent line of research explores this direction in zero-shot scenario: Schick & Schütze (2021); Meng et al. (2022) use PLM with task-dependent prompts to generate data, and finetune another PLM on such data for task prediction. To further investigate PLM's zero-shot ability and alleviate the computation cost of PLM, Ye et al. (2022a) study an extreme scenario, which trains a tiny model from scratch using the synthetic data. Noise Robust Learning. Previous methods of tackling data noise can be categorized into two groups: (1) heuristic approaches based on loss values that rely on the assumption that the network learns easy samples first, which adopt either resampling (Han et al., 2018; Jiang et al., 2018; Yu et al., 2019) , loss reweighting (Thulasidasan et al., 2019; Konstantinov & Lampert, 2019; Ren et al., 2018; Shu et al., 2019) , or label correction (Ma et al., 2018; Kremer et al., 2018; Reed et al., 2014) . These methods require either manually set a threshold for the loss value or a clean validation set, which makes their performance questionable in zero-shot scenario. (2) methods in another line train the network with noise-robust loss (Ghosh et al., 2017; Ma et al., 2020; Liu & Guo, 2020; Xu et al., 2019; Wang et al., 2019) . Despite they learn a robust classifier in theory, they are typically difficult to train the DNNs and result require more hyper-parameter tuning (Zhang & Sabuncu, 2018; Wang et al., 2019) . To this end, we take advantages from both lines of research and design an end-to-end framework which can reliably filter out harmful data without requiring a clean validation set.

7. CONCLUSION

This paper focuses on high-quality data generation in efficient zero-shot learning. To address the noise in data, we design an end-to-end framework to construct a clean synthetic dataset without relying on any human-labeled data or human intervention. Our method can be jointly applied to other data generation pipelines to automatically select high-quality data. We hope this paper can provide insights for improving the quality of synthetic datasets, and inspire more exploration in data-generation-based zero-shot learning via PLM. A.1.1 OPTIMIZATION DIFFICULTY OF NOISE-ROBUST LOSS. First, we analyze the cause of difficulty in the optimization of robust loss: (1) (Zhang & Sabuncu, 2018) show that the gradients of the cross-entropy loss have an implicit weighting term(as shown in Equation 5), which prioritizes the harder samples during training. On the other hand, this weighting does not exist in noise robust loss functions, which means these functions treat all samples equally. The lack of implicit weighting is claimed by (Zhang & Sabuncu, 2018) to be the cause of difficulty in training. n i=1 ∂L(f (xi; θ), yi) ∂θ = n i=1 - 1 fy i (x i ;θ) ∇ θ fy i (xi; θ) for CE n i=1 -∇ θ fy i (xi; θ) for MAE/RCE. (5) (2) Our experiment shows that the surface of the robust loss has a wide flat region, so when the parameters are not close to the optimal solution, the gradients could vanish, which leads to difficulty in optimization. The 3D visualization of the loss surface is shown in Figure 4 . More details about drawing the figure are shown in A.1.2. Second, we verify the performance degradation when training the network with robust loss is caused by optimization difficulty, rather than the lack of good solution. Without loss of generality, we compare the robust-loss ℓ rce to ℓ ce . More specifically, we train the model with one loss, and use another loss to evaluate the model trajectory. From the results in Figure 5 (a), we can clearly see that when optimizing the network with CE, both CE and RCE continue to decline throughout the training, and reach the plateau at the same time, which means that the RCE loss does indeed have a good solution. On the other hand, from the results in Figure 5 (b), when training with RCE, both CE, and RCE hardly decrease, this further verifies that RCE is difficult to optimize when training the network. The loss used to train the network is in red, and the other loss used to evaluate the network is in grey. We have run experiments using various learning rates from {1e-2, 1e-3, 1e-4, 1e-5} and the curves show similar trends. A.1.2 THE OPTIMAL SOLUTIONS OF CE AND RCE ARE CLOSE. Despite the optimization difficulty of RCE loss, the loss surfaces of CE and RCE losses indicate that the two loss functions have close optimal solutions. More specifically, we plot the loss surfaces of CE and RCE losses centered around the solution of CE loss following (Li et al., 2018) , which visualizes the loss surfaces by modifying the model weights in two random directions: f (α, β) = L(θ * + αδ + βη) where θ * is the model parameters optimized by CE, δ and η are the random directions in the vector space of model parameters. α and β are the scaling coefficients of the two directions, which control how far the parameters are perturbed. We parameterize the surface w.r.t the values of α, β ranging from [-1, 1], and calculate the loss values at different positions alongside a 2D grid. We show the visualization in Figure 4 . From sub-figure 4 (d), we can see that the optimal solution of CE is also close to the optimal solution of RCE (also has the lowest value near zero), which is direct experimental support for the Assumption 3. In addition, when the solution is not close to the optimal position, the rce loss surface is flat and wide, which is the cause of optimization difficulty.

A.2 PROOF FOR PROPERTY 1

The noise-robust property of robust loss functions have been proven in previous work (Ghosh et al., 2017; Wang et al., 2019) , we add them here for completeness. Specifically, we consider three cases of label noise: uniform noise, simple non-uniform noise and class-dependent noise following (Ghosh et al., 2017) . Uniform Noise. Given a classification problem with K classes, for a loss function ℓ robust satisfying Eqn. ( 3). Then ℓ robust is noise-tolerant under uniform label noise if the noise rate η < K-1 K . Let ỹ denote the noisy label from S syn , the proof is as follows: L robust (θ, S syn ) =E x,ỹ [ℓ robust (f s(x; θ), ỹ)] =E x,y E ỹ|y,x [ℓ robust (f (x; θ), ỹ)] =E x,y [(1 -η)ℓ robust (f (x; θ), y) + η K -1 K j̸ =y ℓ robust (f (x; θ), j)] =E x,y [ K -1 -Kη K -1 ℓ robust (f (x; θ), y)] + ηC K -1 = K -1 -Kη K -1 L(θ, S clean ) + ηC K -1 where C is a constant due to the property of symmetric loss functions. Suppose θ * is the optimal solution for the clean dataset S clean . Therefore, we have the following inequality for any θ: L(θ * , S syn ) -L(θ, S syn ) = K-1-Kη K-1 (L(θ * , S clean ) -L(θ, S clean )) ≤ 0. Therefore, θ * is also the optimal solution for S syn . Class Dependent Noise. For a loss function ℓ robust satisfying Eq 3, and 0 ≤ ℓ robust (f (x; θ), i) ≤ C K-1 , ∀i ∈ [K], and suppose min θ L(θ, S clean ) = 0. Then ℓ robust is noise-tolerant under classs dependent label noise if η ij < 1 -η i , ∀j ̸ = i, ∀i, j ∈ [K], η ij represents the probability of class i mislabelled into class j. The proof is as follows: L robust (θ, S syn ) =E x,ỹ [ℓ robust (f (x; θ), ỹ)] =E x,y E ỹ|y,x [ℓ robust (f (x; θ), ỹ)] =E x,y [(1 -η y )ℓ robust (f (x; θ), y) + K j̸ =y η yj ℓ robust (f (x; θ), j)] =E x,y [(1 -η y )(C - K j̸ =y ℓ robust (f (x; θ), j)) + K j̸ =y η yj ℓ robust (f (x; θ), j)] =CE x,y (1 -η y ) -E x,y K j̸ =y (1 -η y -η yj )ℓ robust (f (x; θ), j) Denote θ * and θ * as arg min θ L robust (θ, S clean ) and arg min θ L robust (θ, S syn ), respectively. Given the above result, we have the following inequality for θ * and θ * : L robust ( θ * , S syn )-L robust (θ * , S syn ) = E x,y [ K j̸ =y (1-η y -η yj )(ℓ robust (f (x; θ * ), j)-ℓ robust (f (x; θ * ), j))] ≤ 0 Since ℓ robust is always non-negative and L robust (θ * , S clean ) = 0, we have that ℓ robust (f (x; θ * ), y) = 0, ∀x. Thus, we have that ℓ robust (f (x; θ * ), i) = C K-1 , ∀i ̸ = y by the symmetric property of ℓ robust . Given the assumption on the label noise, (1 -η y -η yj ) > 0. Therefore, for the equality to hold, we musts have ℓ robust (f (x; θ * , i) = C K-1 , ∀i ̸ = y. According to the symmetric property of ℓ robust , we know that ℓ robust (f (x; θ * ), y) = 0, ∀x. Which means that θ * achieve zero losss on all clean samples as well. Therefore, θ * is also the minimizer for L robust (θ, S clean ). Non-Uniform Noise. For a loss function ℓ robust satisfying Eq 3, and suppose min θ L(θ, S clean ) = 0. Then ℓ robust is noise-tolrant under non-uniform label noise if η x < K-1 K , ∀x. The proof is as follows: L robust (θ, S syn ) =E x,ỹ [ℓ robust (f (x; θ), ỹ)] =E x,y E ỹ|y,x [ℓ robust (f (x; θ), ỹ)] =E x,y [(1 -η x )ℓ robust (f (x; θ), y) + K j̸ =y η x K -1 ℓ robust (f (x; θ), j)] =E x,y (1 -η x )ℓ robust (f (x; θ), y) + E x,y η x K -1 (C -ℓ robust (f (x; θ), y)) =E x,y C K -1 + E x,y [(1 - Kη x K -1 )ℓ robust (f (x; θ), y)] Therefore, we have the following inequality for any θ: L robust (θ * , S syn ) -L robust (θ, S syn ) = E x,y [(1 -Kηx K-1 )(ℓ robust (f (x; θ * ), y) -ℓ robust (f (x; θ), y))] . Since ℓ robust is always non-negative and L robust (θ * , S clean ) = 0, we have that ℓ robust (f (x; θ * ), y) = 0, ∀x. Therefore, θ * is also the optimal solution for S syn .

A.3 PROOF FOR THEOREM 1

Proof. We first show θ * exists in the space induced by θ(w). Then we show our framework uniquely return θ * . Step 1: Existence. By Assumption 1, we know that P syn (x, y)w * (x, y) = P clean (x, y) in distribution. So Step 2: Uniqueness. By Assumption 2, we have for all θ ̸ = θ * , we have L robust (θ * , S clean ) < L robust (θ, S clean ). So θ * = arg min θ L robust (θ, S clean ). Since arg min θ L robust (θ, S syn ) = arg min θ L robust (θ, S clean ). So we have θ * = arg min θ L robust (θ, S syn ). Putting the existence and uniqueness parts together, we finish the proof. The above theorem shows that there is an explicit mapping between the optimal weighting function w * and the optimal θ * . This mapping is achieved by the inner loop in formulation 2. We can then optimize L ce (θ(w), S clean ) over w to find the optimal w * . Given that Assumption 2 holds, we can switch L ce (θ(w), S clean ) to L robust (θ(w), S clean ). Finally, thanks to Property 1 of ℓ robust , we may simply optimize L robust (θ(w), S syn ) over w instead. This gives rise to the outer loop in formulation 1.

A.4 PROOF FOR THEOREM 2

Proof. Let Ŝv,-1 syn denotes the dataset that replaces any one element of Ŝv syn with arbitrary x, it is easy to know that |L robust ( θ * (w); S v syn ) -L robust ( θ * (w); S v,-1 syn )| ≤ κ M The equality in the first line is due to θ(w * ) = θ * . The inequality comes from the strong convexity assumption. Combining Eqn equation 7, equation 8 and equation 9, we have µ 2 ∥w * -w∥ 2 2 ≤ ϵ. The result follows immediately by noting that w is the output of our method.

C ABLATION STUDY OF OUTER OBJECTIVES

In the main paper, we directly use RCE as an example of noise-robust loss to verify our framework. Here we conduct additional experiments using the other noise-robust Mean Absolute Error (MAE) loss (Ghosh et al., 2017) . The results in Table 7 show that SUNGEN framework using RCE and MAE losses both achieve promising performance, which significantly surpasses baseline methods. Besides, if we consider the standard cross-entropy(CE) loss as the outer objective, the bi-level framework can only achieve marginal improvement or even worse than ZEROGEN. The result is reasonable as the standard CE loss in the outer loop run on synthetic data, which cannot provide accurate guidance to update the per-sample weights. We compare our method with the label smoothing and temporal ensemble noise-training strategies in Table 8 . The experimental results show that our method achieved a significant improvement over these two counterparts. In addition, label smoothing and temporal ensemble are training strategies to alleviate the noise issue, which cannot be used to select a high-quality subset. (Zhang et al., 2015) , Subj (Pang & Lee, 2004) , AGNews (Zhang et al., 2015) and DBpedia (Zhang et al., 2015) . IMDB, SST-2, and Rotten Tomatoes are sentiment classification benchmarks containing positive/negative movie reviews. Amazon and Yelp are comments classification tasks consisting of electronic product reviews and restaurant reviews respectively. We choose electronics and restaurant reviews as they are very different from movie reviews. Subj is a subjectivity detection task to justify whether the text contains factual contents or expresses opinions. AGNews (4-class classification) and DBpedia (14-class classification) are the topic and otologogy classification tasks respectively. Apart from AGNews and DBpedia, other tasks are binary classification tasks. We use full test set for evaluation except for DBpedia, for which we randomly sample 5000 test examples to reduce computational cost. Sample sizes are listed in Table 1 . We report accuracy for evaluation.

F.2 FULL IMPLEMENTATION DETAILS

We compare the baselines using GPT2-XL (Radford et al., 2019) as PLM. For text generation, we use Nucleus Sampling (Holtzman et al., 2020) with p = 0.9 as the decoding strategy and use GPT2-XL as the generator. For fair comparison, we use the best prompts designed by (Ye et al., 2022a) for data generation. During the optimization of sample weights, we use Adam optimizer. For selecting the appropriate value of the outer learning rate, we select from {2.5e-1, 1e-1, 1e-2} by looking at the value of RCE loss in the outer loop. If the outer loss steadily decreases and reaches a low value, then it indicates that the optimization is going well. In the inner loop, 1,000k synthetic data are used as the training data; in the outer loop, 50k synthetic samples are randomly sampled as the training data for fast iteration. We use 1-layer Bi-LSTM and DistilBERT-base as the tiny task model and run it for 1 epoch each time for fast iteration. The bilevel procedure is iterated for 50 times for each task. For task model training, we use 1-layer Bi-LSTM and DistilBERT-base as the light-weight classifiers. For LSTM, we use Adam optimizer (Kingma & Ba, 2015) with learning rate 1e-3. For DistilBERTbase, we finetune each dataset using Adam optimizer with learning rate 2e-5, and other default hyper-parameters as suggested by HuggingFace Transformers library (Wolf et al., 2019) . We run LSTM for 5 epochs, and run DistilBERT-base for 3 epochs for prediction. Unless otherwise stated, we run our experiments on 200k data. We compute the average accuracy on test set over 3 runs using different random seeds. For the baseline using gold data in Table 2 , to simulate the scenario where gold data is scarce, we randomly select 1,000 samples from the standard training set as the training data in outer loop. For comparison with other denoising baselines shown in Table 3 , we use the techniques as described in the original papers. Specifically, for Confidence (Swayamdipta et al., 2020) , we use the mean model probability of the true label across epochs as the confidence value and select top 200k examples; for Co-teaching (Han et al., 2018) , we use two networks, each is trained with samples selected by the other network based on the loss value; for meta-weight-net (Shu et al., 2019) , since we do not have access to clean data, we use a part of synthetic data as validation. Co-teaching and meta-weight-net are conducted on 200k synthetic data. H TRUNCATED BACK-PROPAGATION FOR META GRADIENT. For solving the bilevel optimization problem, the gradient of w can be calculated as follows: ∇wLrobust = ∇ θ Rrobust| θ * ∇w θ(w) (10) = ∇ θ Lrobust| θ T j≤T   k<j I - ∂ 2 Lce ∂θ∂θ ⊺ θ T -k-1   ∂ 2 Lce ∂θ∂w ⊺ θ T -j-1 ≈ ∇ θ Lrobust| θ T ∂ 2 Lce ∂θ∂w ⊺ θ T -1 , where Eqn. (10) follows chain rule. For computational efficiency, we do not unroll the entire T steps, but perform 1-step truncated backpropagation as in Eqn. ( 11) (Shaban et al., 2019) .

I COMPARISON WITH OTHER NOISE-ROBUST LEARNING METHODS

Our framework has the following advantages against other noise-robust learning methods: • Compared with heuristic methods (Han et al., 2018; Jiang et al., 2018) , our framework is end-to-end and does not require excessive manual tuning and task-specific knowledge. • Since both the inner and outer objectives are calculated on the same synthetic training set, we do not need any in-domain labeled data, which is a must in the previous end-to-end reweighting methods (Ren et al., 2018; Shu et al., 2019) . • Compared with methods that train the model with ℓ robust (Zhang & Sabuncu, 2018; Wang et al., 2019) , our approach leverages ℓ robust to learn the sample weights, which enables removing the low-quality data without hurting the model's performance

J MORE RELATED WORK

Bilevel Optimization Bilevel optimization (BO) Sinha et al. (2017) , which is what we rely upon when building our algorithm, has received much attention recently. This optimization technique has the ability to tackle problems with hierarchical structures. BO has been successfully adopted in numerous applications, such as hyper-paramter optimization Lorraine et al. ( 2020 

K SELECTED AND REMOVED DATA

The selected and removed examples are listed in Table 6 . We take IMDb as the example task. The observation shows that most of the removed data (data with low weights) have noise label, which indicates the class of text is wrongly labeled by PLM during generation(Noisy Y). Besides, there is a small part of erroneous data which contains unrelated text to the task, meaning the generated text is not a movie review showing obvious emotional tendency(Unrelated X). From Figure 3 (d), we find the percentage of the erroneous data is small, but it significantly degrades the model performance(e.g., IMDB accuracy is decreased from 86.56 to 78.29 in Table 4 ). For selected data by SUNGEN, we find they are actually well-written transitional complex sentences (bottom part of Table 6 ), which verifies that SUNGEN tends to select correct and challenging samples. We further empirically analyze why the popular loss-value-based methods (Shu et al., 2019; Han et al., 2018) fail to help sample noise-free dataset in our scenario. For this experiment, we manually construct a dataset with a 30% label noise using the training set of SST-2 by randomly flipping the class label. As shown in figure 6 , the loss values of the correctly labelled data and the mislabelled data are still clustered together. We conjecture that this is due to the nature of the task: not all tasks demonstrate the phenomenon that noisy data demonstrate higher loss values, which is also mentioned in works related to instance dependent noise learning (Cheng et al., 2020) . Therefore, selecting subset based on the loss value is not applicable in our scenario.



Figure 1: Training and testing accuracy of LSTM model trained on synthetic dataset. After training for more epochs, the testing performance of ZERO-GEN starts to deteriorate significantly, indicating that the model starts to fit the erroneous data.

Figure 2: The framework of SUNGEN. Our bi-level framework learns sample weights w measuring data quality without relying on any human-annotated data. In the inner loop, we train a tiny task model (TAM) with weighted CE loss based on current sample weights, and produce trained TAM parameters θ(w); in the outer loop, we adopt a noise-robust loss to guide the learning of w by evaluating θ(w) on a synthetic validation set.

i = D), based on which we then sample a Bernoulli random variable I i ∼ Ber(w ′ i ) for each sample to indicate whether they should be included. Denote the size of sampled subset as D. Because E I∼p(I|w) ∥I∥ 0 = n i=1 w ′ i = D, it is clear that D meaning that D is close to D. Notably, because the important and noise-free data are associated with larger weights, training on the sampled subset can perform on par with weighted training over the entire dataset, while being much more efficient.

Figure 3: Histogram of learnt weights in IMDb synthetic dataset (1,000k). The weights are gradually separated as optimization proceeds, indicating SUNGEN can differentiate high-quality data from erroneous ones.

Figure 5: Loss curves of model training. Experiments run on the IMDb standard training set.The loss used to train the network is in red, and the other loss used to evaluate the network is in grey. We have run experiments using various learning rates from {1e-2, 1e-3, 1e-4, 1e-5} and the curves show similar trends.

θ(w * ) = arg min θ L robust (θ, S syn (w * )) = arg min θ w * (x, y)ℓ((f (θ, x), y)P syn (x, y)dxdy = arg min θ E Psyn(x,y)w * (x,y) ℓ((f (θ, x), y) = arg min θ E P clean (x,y) ℓ(f (θ, x), y) =θ * .

); Maclaurin et al. (2015); MacKay et al. (2019); Franceschi et al. (2017); Vicol et al. (2021), neural architecture search Pham et al. (2018); Liu et al. (2018); Pham et al. (2018); Shi et al. (2020); Yao et al. (2021); Gao et al. (2022; 2021a); Shi et al. (2021), meta learning Finn et al. (2017); Nichol & Schulman (2018), dataset condensation Wang et al. (2018); Zhao et al. (2020); Cazenavette et al. (2022); Pi et al. (2022) and sample re-weighting Ren et al. (2018); Shu et al. (2019); Zhou et al. (2022a;b).

(a) Epoch 0(54%). (b) Epoch 1( 61%). (c) Epoch 2(63%). (d) Epoch 3(60%). (e) Epoch 4(62%).

Figure6: Loss histogram of SST-2 with 30% uniform label noise. Clean data and noisy data are marked by green and red respectively. The accuracy is listed in the parentheses. We observe that the loss value can not separate the clean data from the noisy ones. Therefore, the methods loss-valuebased methods may not work well in our case.

Evaluation results for SUNGEN framework on two different scales of TAM. The scale of synthetic dataset is 200k for both ZEROGEN and SUNGEN. The scales of labeled data in supervised setting are listed under task names. "Gold Data" refers to the standard dataset with human annotations.

Evaluation results using different validation sets (S v ) in the outer loop. LSTM is used as TAM.

Experimental comparison with other de-noise methods using LSTM as TAM.

Results of SUNGEN-LSTM on different data sizes. Given subsets of SUNGEN, models are trained with ℓ ce . For the 1,000k set of SUN-GEN, the model is trained using weighted ℓ ce . Subsets that surpass the original full set (1,000k) are marked by bold. Performance of original full set is marked by underline.

Diversity and Correctness. We measure the diversity by Self-BLEU4 and correctness by accuracy of an oracle model (finetuned RoBERTa-Large). Lower Self-BLEU4 score indicates higher diversity. "SUNGEN-Top" and "SUNGEN-Bottom" represent 10k samples with highest and lowest weights respectively.

Examples of removed data and selected data in IMDb synthetic dataset.

Experimental comparison using different outer objectives. Since no clean validation set exists in the zero-shot setting, all the outer objectives are calculated on synthetic data. The experiments run on 200k synthetic data using LSTM as TAM.

Comparison between SUNGEN with other noise-robust training strategies. Experiments run on 200k synthetic data using LSTM as TAM.

Comparison between SUNGEN with other noise-robust training strategies. Experiments run on 20k synthetic data using DistillBERT as TAM.

annex

Summary and Discussion. In the sections below, we first show that (1) the poor performance of training the neural network with robust losses is due to the optimization difficulty, rather than the lack of good solution. We verify this by showing that the loss value of RCE is able to steadily decrease when the network is trained with CE loss. However, when training the network with RCE, the loss value fails to decrease. Then, we show that (2) the optimal solutions of CE and RCE losses are close by plotting the loss surface for both losses around the optimal solution obtained with CE loss. We are able to observe that the RCE loss is also close to the minimum around the optimal solution of CE loss. We believe the above two experiments can verify that the Assumption 2 and 3 are reasonable. holds for any w. Then by the bounded difference inequality (Corollary 2.21 of Wainwright ( 2019)), given w, we have with probability 1 -δ,Then we haveThe first inequality because we require inequality equation 6 to hold uniformly for all |W| functions. The second inequality is because ŵ is the ϵ-approximated solution. The third inequality is applying inequality equation 6. The forth inequality is because |W| > 1. Taking infimum over w on the right hand side, we obtain the desired bound.

B RELAXING ASSUMPTION 2

Here we provide additional results on relaxed assumptions. We relax Assumption 2 as follows: Assumption 3. The optimal θ * achieves ϵ-optimal robust loss L robust (θ, S clean ) on the clean dataset, i.e., L robust (θ * , S clean ) < L robust (θ, S clean ) + ϵ for all θ ̸ = θ * with ϵ > 0.Then we have the following results that extend Theorem 1:Theorem 3. If Assumption 1 holds, there exists a w * such that θ(w * ) = θ * . Further with Assumption 3 and Property 1 and assume L robust (θ(w), S clean ) is µ-strongly convex and differentiable w.r.t. w , our method can return a w that is close to w * as follows:Further, if the minimizer of the robust loss on the clean data collides with θ * , i.e., ϵ = 0, we have w = w * and θ = θ * .Proof. Similar to the first step of the proof of Theorem 1 in Appendix A.3, we know θ(w * ) = θ * . Further by Property 1 we know the solution of arg min θ L robust (θ, S syn ) is the same with that of arg min θ L robust (θ, S clean ). We further haveDenote w = arg min θ L robust ( θ(w), S clean ), it follows that ∇ w L robust ( θ(w), S clean ) = 0 (7) By Assumption 3, we then haveFor IMDb, SST-2, and Rotten Tomatoes, we use the best prompts designed by Ye et al. (2022a) in both PROMPTING and data-generation settings. For tasks that are not investigated by ZEROGEN, following Ye et al. (2022a) , we manually design five prompts for each task and report the result of the best prompt on PROMPTING, then we use the same prompt(or minor revision version) for ZEROGEN and SUNGEN to generate data. The details of prompts are shown in Table 10 . We empirically verify the effectiveness of bi-level SUNGEN than one-level optimization using l rce (One-level, ℓ rce ). From the results in Table 11 , we can observe that our framework with bi-level ℓ rce outperforms both one-level counterparts significantly. 

