META-WEIGHTED LANGUAGE MODEL TUNING FOR AUGMENTATION-ENHANCED FEW-SHOT LEARNING Anonymous

Abstract

Recent studies have revealed the intriguing few-shot learning ability of pretrained language models (PLMs): They can quickly adapt to a new task when fine-tuned on a small amount of labeled data formulated as prompts, without requiring abundant task-specific annotations. Despite their promising performance, most existing fewshot approaches that only learn from the small training set still underperform fully supervised training by nontrivial margins. In this work, we study few-shot learning with PLMs from a different perspective: We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples which augment the original training set. To encourage the generator to produce label-discriminative samples, we train it via weighted maximum likelihood where the weight of each token is automatically adjusted based on a discriminative meta-learning objective. A classification PLM can then be fine-tuned on both the few-shot and the synthetic samples with regularization for better generalization and stability. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods, improving no-augmentation methods by 5+ average points, and outperforming augmentation methods by 3+ average points 1 .

1. INTRODUCTION

Recent research has demonstrated the appealing few-shot learning potential of pretrained language models (PLMs) (Brown et al., 2020; Clark et al., 2020; Devlin et al., 2019; He et al., 2021; Liu et al., 2019; Meng et al., 2021) on natural language understanding (NLU) tasks (Wang et al., 2019; 2018) : Instead of relying on abundant task-specific annotations, PLMs can effectively leverage a small set of training samples to quickly learn a new task. Such training data efficiency is usually achieved by formulating downstream tasks as prompts (Brown et al., 2020; Gao et al., 2021; Scao & Rush, 2021; Schick & Schütze, 2021a; d) which allow the PLM to adapt its language modeling ability acquired through pretraining to new downstream tasks. The success of prompt-based methods has stimulated numerous explorations along the line of effective few-shot learning with PLMs: The training samples converted to natural language prompts can be used to directly fine-tune PLMs (Gao et al., 2021; Schick & Schütze, 2021a) or as in-context demonstrations to facilitate better inference (Brown et al., 2020; Liu et al., 2022b) . More recent approaches aim to automate the design of prompts by gradient-based searching (Shin et al., 2020) or parameterizing prompts as continuous learnable embeddings (Lester et al., 2021; Liu et al., 2021b; Zhang et al., 2022; Zhong et al., 2021) . Other studies investigate and address specific issues in promptbased few-shot learning (Liu et al., 2022a; Tam et al., 2021; Zhao et al., 2021) . While remarkable, the model performance still has a nontrivial gap from fully supervised models trained on massive labeled data. Indeed, training deep models is inherently data demanding-model generalization usually benefits from more training samples (Baum & Haussler, 1988) . In this work, we study few-shot learning with PLMs from a different perspective: Instead of proposing new methods for fine-tuning on few-shot samples, we focus on the generation of quality training data based on few-shot samples and using these synthesized training samples to fine-tune the classification models. Motivated by the strong text generation power of autoregressive PLMs (Brown et al., 2020; Keskar et al., 2019; Raffel et al., 2019) , previous data augmentation methods enlarge the training set by synthesizing new samples based on the few-shot samples. They either fine-tune the generator on the training set with the standard maximum likelihood objective (Anaby-Tavor et al., 2020; Kumar et al., 2020) or use the training samples as demonstrations (Yoo et al., 2021) . However, these methods do not explicitly model the distinction across different labels and may struggle to generate accurate training samples pertaining to the desired labels for challenging NLU tasks. In this paper, we study how to use few-shot samples to effectively tune PLMs to generate high quality label-discriminative training samples. Our contributions are as follows: (1) We analyze the issues of using standard maximum likelihood for tuning the generator and propose a meta-weighted maximum likelihood objective for generator tuning by automatically learning token weights that emphasize label discriminativeness. (2) We propose a simple and effective training procedure for fine-tuning classification PLMs on generated data by mitigating label noise. (3) Under the same few-shot learning setting, our method FewGen outperforms existing methods by 3+ average points on seven classification tasks of the GLUE benchmark (Wang et al., 2018) . Ablation studies demonstrate the effectiveness of our proposed meta-weighted training objective and classifier fine-tuning method.

2. RELATED WORK

Few-Shot Learning with PLMs. Few-shot learning has gained much attention recently due to its minimal resource assumption-Without requiring massive annotated data but only leveraging a few training samples (e.g., 16 per label), few-shot methods can be widely adopted in many practical scenarios where obtaining large-scale annotations is unaffordable. Standard fine-tuning of PLMs for few-shot learning usually performs poorly because the limited training samples may not be sufficient for optimizing the parameters in the newly introduced classification head. To reuse the language modeling ability of PLMs without introducing randomly initialized parameters, prompt-based approaches (Brown et al., 2020; Gao et al., 2021; Hu et al., 2022; Logan IV et al., 2021; Min et al., 2022; Schick & Schütze, 2021a; b; d; Tam et al., 2021) formulate training samples as natural language prompt templates so that various downsteam tasks can be solved as a token prediction problem. They enjoy improved training data efficiency over standard fine-tuning in lowdata regimes (Scao & Rush, 2021) and achieve remarkable few-shot learning performance. Later developments in prompt-based methods replace the manual design of prompt templates with automatic search or learning (Cui et al., 2022; Hambardzumyan et al., 2021; Lester et al., 2021; Liu et al., 2021b; Zhang et al., 2022; Zhong et al., 2021) . There are also studies focusing on specific issues in prompt-based methods such as densifying the supervision by revising the training objective (Liu et al., 2022a; Tam et al., 2021) and calibrating the biased predictions of PLMs before fine-tuning (Zhao et al., 2021) . Instead of focusing on fine-tuning methods for few-shot learning, we study how to effectively generate abundant quality training samples by learning from the few-shot samples and use them to improve the generalization of the classification model. Data Augmentation. Data augmentation methods (Chen et al., 2020; Lee et al., 2021; Miyato et al., 2017; Xie et al., 2020) aim to create similar samples to the existing ones so that the enlarged training set can benefit model generalization. Early approaches simply use manually designed rules (e.g., swapping or inserting tokens) for word-level alterations over the given samples to create new ones (Wei & Zou, 2019) . Later methods leverage the strong generation power of PLMs to synthesize novel samples from scratch. Given a training set, the PLMs can be either fine-tuned on the labeled samples to learn label-conditioned generation probability (Kumar et al., 2020; Lee et al., 2021; Yang et al., 2020) or take the labeled data as demonstrations (Wang et al., 2021; Yoo et al., 2021) to generate similar samples pertaining to the same label. In this work, we study how to effectively tune generators on few-shot training data for creating new data-standard fine-tuning of PLMs on a small set of training data is prone to overfitting, and the resulting model may struggle to generate accurate, diverse and novel training data. We address this challenge by leveraging prefix-tuning and proposing a new meta-weighted training objective to emphasize label-discriminative tokens for generator tuning.

Controlled Text Generation.

Generating training samples for different labels can be viewed as a form of controlled text generation (Hu et al., 2017) , whose goal is to generate textual contents of desired semantics, styles or attributes. Such control can be realized through different stages of PLM training and deployment: During pretraining, control codes (Keskar et al., 2019) can be used as  f u Y C f Y A M J G Y i z 3 2 z Z t f t K a x F 4 p S k h k q 0 f P P L H c Q k j a g A w r F S f c d O w M u w B E Y 4 z a t u q m i C y R g P a V 9 T g S O q v G z 6 Q 2 6 d a G V g h b H U J c C a q r 8 n M h w p N Y k C 3 V l c r O a 9 Q v z P 6 6 c Q X n o Z E 0 k K V J D Z o j D l F s R W E Y g 1 Y J I S 4 B N N M J F M 3 2 q R E Z a Y g I 6 t q k N E 4 = " > A A A C A X i c b V B N S 8 N A E N 3 4 W e t X 1 Y v g J V g E T y W R o h 4 L e v B Y w X 5 A G 8 J m O 2 2 X b j Z h d y K W E C / + F S 8 e F P H q v / D m v 3 H b 5 q C t D w Y e 7 8 0 w M y + I B d f o O N / W 0 v L K 6 t p 6 Y a O 4 u b W 9 s 1 v a 2 2 / q K F E M G i w S k W o H V I P g E h r I U U A 7 V k D D Q E A r G F 1 N / N Y 9 K M 0 j e Y f j G L y Q D i T v c 0 b R S H 7 p s B t S H D I q 0 u v M T 7 s I D 5 g O Q G a Z X y o 7 F W c K e 5 G 4 O S m T H H W / 9 N X t R S w J Q S I T V O u O 6 8 T o p V Q h Z w K y Y j f R E F M 2 o g P o G C p p C N p L p x 9 k 9 o l R e n Y / U q Y k 2 l P 1 9 0 R K Q 6 3 H Y W A 6 J / f q e W 8 i / u d 1 E u x f e i m X c Y I g 2 W x R P x E 2 R v Y k D r v H F T A U Y 0 M o U 9 z c a r M h V Z S h C a 1 o Q n D n X 1 4 k z b O K e 1 6 p 3 l b L t W o e R 4 E c k W N y S l x y Q W r k h t R J g z D y S J 7 J K 3 m z n q w X 6 9 3 6 m L U u W f n M A f k D 6 / M H v s i X s g = = < f u Y C f Y A M J G Y i z 3 2 z Z t f t K a x F 4 p S k h k q 0 f P P L H c Q k j a g A w r F S f c d O w M u w B E Y 4 z a t u q m i C y R g P a V 9 T g S O q v G z 6 Q 2 6 d a G V g h b H U J c C a q r 8 n M h w p N Y k C 3 V l c r O a 9 Q v z P 6 6 c Q X n o Z E 0 k K V J D Z o j D l F s R W E Y g 1 Y J I S 4 B N N M J F M 3 2 q R E Z a Y g I 6 t q k N N j c C u S F V q z m K w q K T G L k l S e J Y Z h C R W e B p f f K 3 8 0 0 s 0 V q b 6 O 0 0 y 7 C c w 1 n I k B Z C T B v U v O 5 G G W E G U A J 0 L U M V R O S g i w m s q r n b G q M t y U G / 4 z f 2 9 d i t s c 7 / p + 5 2 g F V S k 1 Q l 3 Q x 4 4 p U K D T X E 8 q P + O h q n I E 9 Q k F F j b C / y M + g U Y k k J h W Y t y i x m I C x h j z 1 E N C d p + 8 f R K y b e c M u S j 1 L j S x J / U l x M F J N Z O k t h 1 V h f b t 1 4 l v u f 1 c h r t 9 Q u p s 5 x Q i + d F o 1 x x S n m V C x 9 K g 4 L U x B E Q R r p b u T g H A 4 J c e j U X w r 9 P + f / J S a s Z t J v h t 7 B x E E 7 j W G S f 2 S b b Z g H r s A N 2 y I 5 Z l w l 2 w + 7 Y P X v w b r 0 f 3 q P 3 8 7 l 1 x p v O b L B X 8 H 7 9 B a Q j m 4 E = < / l a t e x i t > rL w-gen < l a t e x i t s h a 1 _ b a s e 6 4 = " x A c T 6 6 e h S 1 i W G f 3 A p J e x v Y 9 8 U 9 o = " > A A A B 6 n i c d V D L S s N A F L 2 p r 1 p f V Z d u B o v g K i Q 2 t e m u 4 M Z l R f u A N p T J d N I O T i Z h Z q K U 0 k 9 w 4 0 I R t 3 6 R O / / G 6 U N Q 0 Q M X D u f c y 7 3 3 h C l n S j v O h 5 V b W V 1 b 3 8 h v F r a 2 d 3 b 3 i v s H L Z V k k t A m S X g i O y F W l D N B m 5 p p T j u p p D g O O W 2 H t x c z v 3 1 H p W K J u N H j l A Y x H g o W M Y K 1 k a 7 v + 2 6 / W H J s z 3 c q v o M c u 1 L 2 y 9 W y I b W a X / G r y L W d O U q w R K N f f O 8 N E p L F V G j C s V J d 1 0 l 1 M M F S M 8 L p t N D L F E 0 x u c V D 2 j V U 4 J i q Y D I / d Y p O j D J A U S J N C Y 3 m 6 v e J C Y 6 V G s e h 6 Y y x H q n f 3 k z 8 y + t m O v K D C R N p p q k g i 0 V R x p F O 0 O x v N G C S E s 3 H h m A i m b k V k R G W m G i T T s G E 8 P U p + p + 0 z m z 3 3 P a u v F L d W 8 a R h y M 4 h l N w o Q p 1 u I Q G N I H A E B 7 g C Z 4 t b j 1 a L 9 b r o j V n L W c O 4 Q e s t 0 + Y 8 I 3 + < / l a t e x i t > w 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " T J S 9 t a 5 i + J 8  9 d 9 O Z D 3 t 2 Q 9 Z N c E Q = " > A A A B 6 n i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w F S Z p s O m u 4 M Z l R f u A N p T J d N I O n U z C z E Q p o Z / g x o U i b v 0 i d / x R q V g i b v U 0 p U G M R 4 J F j G B t p J v 7 g T s o V 5 D t V m s I O R D Z V c e p + b 4 h 9 b r v V j 3 o 2 G i B C l i h O S i / 9 4 c J y W I q N O F Y q Z 6 D U h 3 k W G p G O J 2 V + p m i K S Y T P K I 9 Q w W O q Q r y x F + i o X 0 I = " > A A A B 6 n i c d V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h Z 0 Q 8 r g F v H i M a B 6 Q L G F 2 M p s M m Z 1 d Z m a V s O Q T v H h Q x K t f 5 M 2 / c T a J o K I F D U V V N 9 1 d f i y 4 N q 7 7 4 a y t b 2 x u b e d 2 8 r t 7 + w e H h a P j j o 4 S R V m b R i J S P Z 9 o J r h k b c O N Y L 1 Y M R L 6 g n X 9 6 W X m d + + Y 0 j y S t 2 Y W M y 8 k Y 8 k D T o m x 0 s 3 9 U A 4 L R b f k u i 7 G G G U E 1 6 q u J Y 1 G v Y z r C G e W R R F W a A 0 L 7 4 N R R J O Q S U M F 0 b q P 3 d h 4 K V G G U 8 H m + U G i W U z o l I x Z 3 1 J J Q q a 9 d H H q H J 1 b Z Y S C S N m S B i 3 U 7 x M p C b W e h b 7 t D I m Z 6 N 9 e J v 7 l 9 R M T 1 L 2 U y z g x T N L l o i A R y E Q o + x u N u G L U i J k l h C p u b 0 V 0 Q h S h x q a T t y F 8 f Y r + J 5 1 y C V d L l e t K s V l Z x Z G D U z i D C 8 B Q g y Z c Q Q v a Q G E M D / A E z 4 5 w H p 0 X 5 3 X Z u u a s Z k 7 g B 5 y 3 T 7 J M j g 0 = < / l a t e x i t > w n < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 L u 7 0 n k E T q y Q 5 explicit guidance for training the model to generate domain/attribute-specific texts; fine-tuning PLMs with attribute-specific data can also grant high-level control (e.g., certain topics or sentiments (Ziegler et al., 2019) ), fine-grained control (e.g., specific words or phrases (Chan et al., 2021) ) or both (Khalifa et al., 2021) ; at inference time, control over desired attributes can also be enforced without updating the PLM parameters (Dathathri et al., 2020; Krause et al., 2021; Kumar et al., 2021; Liu et al., 2021a; Pascual et al., 2021; Yang & Klein, 2021) . Recently, a few studies explore fine-tuning autoregressive PLMs (Anaby-Tavor et al., 2020; Yang et al., 2020) with the standard language modeling objective on the training set or using label-specific prompts (Meng et al., 2022; Schick & Schütze, 2021c; Wang et al., 2021; Ye et al., 2022) to steer text generation towards the desired label. k P T 6 A A t p T C s j n o = " > A A A C C X i c d V A 9 S w N B E N 3 z M 8 a v q K X N Y h B s D H c h x K Q L 2 F h Y K J g k o U g x q L R K S a A d U g u I Q a c h T Q j B X Q M B D Q C I a X Y 7 9 x D 0 r z S N 7 i K A Y / p H 3 J e 5 x R N F L H P m y H F A e M i v Q 6 6 6 R t h A d M + y C z 7 K 7 U s Q t u 0 Z 3 A W S T e j B T I D N W O / d X u R i w J Q S I T V O u W 5 8 b o p 1 Q h Z w K y f D v R E F M 2 p H 1 o G S p p C N p P J z 9 k z o l R u k 4 v U q Y k O h P 1 9 0 R K Q 6 1 H Y W A 6 x x f r e W 8 s / u e 1 E u x d + C m X c Y I g 2 X R R L x E O R s 4 4 E K f L F T A U I 0 M o U 9 z c 6 r A B V Z S h i S 1 v Q v D m X 1 4 k 9 V L R O y u W b 8 q F S n k W R G 0 U h + 8 a A T U h w w K t L r z E 8 7 C A + Y 9 k F m 2 Z 3 r F 0 t O 2 Z n A X i T u j J T I D D W / + N X p R i w J Q S I T V O u 2 6 8 T o p V Q h Z w K y Q i f R E F M 2 p H 1 o G y p p C N p L J z 9 k 9 r F R u n Y v U q Y k 2 h P 1 9 0 R K Q 6 1 H Y W A 6 x x f r e W 8 s / u e 1 E + x d e C m X c Y I g 2 X R R L x E 2 R v Y 4 E L v L F T A U I 0 M o U 9 z c a r M B V Z S h i a 1 g Q n D n X 1 4 k j d O y e 1 a u 3 F R K 1 c o s j j w 5 J E f k h L j k n F T J F a m R O m H k k T y T V / J m P V k v 1 r v 1 M W 3 N W b O Z f f I H 1 u c P A z y Y X Q = = < X V t f x 6 Y W N z a 3 u n u L v X 0 G G s G N R Z K E L V 8 q k G w S X U k a O A V q S A B r 6 A p j + 6 z P z m P S j N Q 3 m L 4 w i 8 g A 4 k 7 3 N G 0 U j d 4 k E n o D h k V C T X a T f p I D x g M g C Z p n f G L D l l Z w J 7 k b g z U i I z 1 L r F r 0 4 v Z H E A E p m g W r d d J 0 I v o Q o 5 E 5 A W O r G G i L I R H U D b U E k D 0 F 4 y + S G 1 j 4 3 S s / u h M i X R n q i / J x I a a D 0 O f N O Z X a z n v U z 8 z 2 v H 2 L / w E i 6 j G E G y 6 a J + L G w M 7 S w Q u 8 c V M B R j Q y h T 3 N x q s y F V l K G J r W B C c O d f X i S N 0 7 J 7 V q 7 c V E r V y i y O P D k k R + S E u O S c V M k V q Z E 6 Y e S R Meta-Learning for Sample Weighting. The idea of weighting training samples in the loss calculation originates from the class imbalance (Wang et al., 2017) and noisy label (Hendrycks et al., 2018) learning scenarios-By assigning higher weights to the samples from minority classes or lower weights to the noisy samples, the learning process is less impacted by the imbalance/label noise issues. Meta-learning (Andrychowicz et al., 2016; Finn et al., 2017; Franceschi et al., 2018; Wu et al., 2018) is one way to automatically learn the weight for each sample. Specifically, a meta objective, usually defined as the loss on a clean unbiased validation set (Ren et al., 2018; Shu et al., 2019) , can be used to learn the sample weights which become hyperparameters that control the optimization of model parameters. Our work has a different motivation and formulation of the meta objective for token-wise weighted training: Not all tokens in a training sample are equally label-discriminative. We thus design a meta objective to emphasize distinction across different labels (instead of using the validation loss as the meta objective) for learning the token weights.

3.1. PRELIMINARIES

Overview. We consider the strict few-shot learning setting (Perez et al., 2021) Text Generation with Autoregressive PLMs. In standard fine-tuning for text generation, an autoregressive PLM G θ is trained via the maximum likelihood generation loss of each token in a sequence x conditioned on previous tokens: min θ - 1 n n j=1 log p θ (x j |x <j ), p θ (x j |x <j ) = exp(e ⊤ j h j ) |V | j ′ =1 exp(e ⊤ j ′ h j ) . where the token generation probability p θ (•) is usually parameterized using token embeddings e and hidden states h of a Transformer (Vaswani et al., 2017) 2) the generation models for different labels can share the same backbone Transformer parameters with only the prefix vectors being different, significantly reducing the memory requirement for multi-class classification tasks.

3.2. LABEL-DISCRIMINATIVE TEXT GENERATOR TUNING WITH META WEIGHTS

Motivation. To model the conditional text generation probability p(x|y l ) on different labels, a straightforward way is to parameterize a generation model G θp l for each label y l via a set of prefix vectors θ p = {θ p l }| L l=1 so that p(x|y l ) = p θp l (x), and then tune θ p l on the training samples x with label y l : However, such an approach only optimizes the generative likelihood p(x|y l ) without accounting for label discriminativeness p(y l |x) which is essential for generating unambiguous training samples to benefit the final classification task. Challenging NLU tasks can have largely similar distributions across different labels, with very nuanced differences reflected by a few key tokens. For example, a negative review text "a movie where the ending feels like a cop-out" may immediately become a positive one by just changing the last word "cop-out" to "revelation". Indeed, we find that such subtle distinctions over different labels may not be effectively captured by the generators if they are trained with the standard generation objective in Eq. ( 1). As shown in Fig. 2 , L disc (defined in Eq. ( 2)) can even increase during training-It is possible that the dominating patterns in the training samples are label-indiscriminate (e.g., a movie review dataset may frequently mention "the movie"), making the generators of different labels eventually converge to similar distributions, especially when there are limited training samples per label. min θp l L gen , L gen (θ p l ) = - 1 n n j=1 log p θp l (x j |x <j ). To promote the generation of label-discriminative texts, we encourage each token x j to be more likely generated under the corresponding label y l instead of other labels (i.e., maximize p θp l (x j |x <j ) and minimize p θp l ′ (x j |x <j ) for l ′ ̸ = l) via a discriminative loss L disc : L disc (θ p ) = - 1 n n j=1 L j disc (θ p ), L j disc (θ p ) = p θp l (x j |x <j ) L l ′ =1 p θp l ′ (x j |x <j ) Although one can directly combine L disc with L gen to train G θp to enforce distinction across different labels, doing so will result in two undesirable consequences: (1) A hyperparameter needs to be introduced to balance the weights of the two losses, whose optimal value is likely to vary by task; and (2) the generation-irrelevant loss L disc will unavoidably interfere the language modeling process, making the resulting model prone to generating less fluent and coherent texts. p (with task-descriptive prompts) and ω (0) for t ∈ [0, 1, . . . , T -1] do B ← Sample a minibatch from D train θ(t) p ω (t) ← Take one gradient step to descend L w-gen θ (t) p ; ω (t) on B ω (t+1) ← Take one gradient step to descend L disc θ(t) p ω (t) on B θ (t+1) p ← Take one gradient step to descend L w-gen θ (t) p ; ω (t+1) on B end return θ p = θ (T ) p Weighted Maximum Likelihood Generator Tuning. To preserve the generative learning of G θp while emphasizing label-discriminative tokens, we assume each token is associated with a weight in the maximum likelihood loss. Intuitively, when our goal is to generate distinctive texts across different labels as training samples, not all tokens should contribute equally to generator training. For example, for sentiment classification tasks, one would expect "good/bad" to be more label-discriminative than "the movie", and the former should be paid more attention to during training. It is thus natural to generalize L gen in Eq. ( 1) to L w-gen as follows by assuming a weight w j is given for each token. min θp l L w-gen , L w-gen (θ p l ; w) = - n j=1 w j L j gen (θ p l ), L j gen (θ p l ) = log p θp l (x j |x <j ). Note that in L w-gen , w is assumed to be the hyperparameter under which θ p l is optimized. When w j is the same for every token, Eq. (3) will be equivalent to Eq. (1). While it is possible to manually design weighting rules for setting w to promote label-discriminative learning, they will likely necessitate task-specific knowledge and nontrivial tuning. To facilitate the automatic learning of these weights w, we propose to parameterize them as learnable hyperparameters using the idea of meta-learning. Meta Weight Learning Setup. To automatically learn token weights using the idea of metalearning, we formulate a bi-level optimization problem, where the inner objective L w-gen optimizes the generator parameters θ p , and the outer objective L disc optimizes token weights w that are used as hyperparameters by the inner objective. We parameterize token weights w via a weighting network g ω so that w j = w j (ω). Details about the implementation of g ω are in Appendix E. Overall, the learning objectives are as follows: θ * p (ω) = argmin θp L w-gen , L w-gen (θ p ; ω) = - n j=1 w j (ω)L j gen (θ p ) ω * = argmin ω L disc , L disc (θ * p (ω)) = - 1 n n j=1 L j disc (θ * p (ω)) Under the above formulation, the token weights w j (ω) are automatically learned such that the resulting generator parameters θ * p (ω) capture label-discriminative information (i.e., minimize L disc ). Instead of solving the optimal ω * and θ * p via nested optimization loops, we use an online optimization strategy (Shu et al., 2019) for training efficiency. It also guarantees convergence to the critical points of both L w-gen and L disc under mild conditions. The initialization prompts can be found in Appendix C. The overall training procedure is shown in Algorithm 1. Analysis of Meta Weight Learning. We analyze the gradient update of meta weights to study its effect in generator tuning. The weighting network parameter ω is optimized via Eq. ( 4), and its gradient is as follows (detailed derivation in Appendix A): - ∂Ldisc θ(t) p (ω) ∂ω ω=ω (t) ∝ n j=1 dj ∂wj (ω) ∂ω ω=ω (t) , dj = ∂Ldisc θp ∂ θp θp= θ(t) p ∂L j gen (θp) ∂θp ⊤ θp=θ (t) p . It can be seen that the gradient descent direction of ω is determined by a weighted sum of token weight gradient ascent direction (i.e., ∂wj ∂ω ), where the weight d j characterizes the similarity between the gradient of the discriminative objective and the gradient of the generative objective on the jth token. Therefore, the meta weights will be higher on those tokens where optimizing their generative objective is more beneficial for minimizing the discriminative objective.

3.3. CLASSIFIER FINE-TUNING

With the trained generator G θp , we can synthesize novel training samples D gen that augment D train for fine-tuning a PLM C ϕ for classification. The major challenge to effectively leverage D gen is that the label noise (i.e., some generated samples may not accurately pertain to the corresponding label) may deteriorate model performance if standard supervised learning is directly used. We propose a simple noise-robust training procedure to improve the generalization and stability of training: First fine-tune C ϕ on D train with standard supervised training, and then continue fine-tuning it on D gen by applying temporal ensembling (Laine & Aila, 2017) as regularization. Specifically, given a training sample ( x, ỹ) ∈ D gen , we minimize the following classification loss: min ϕ L class , L class (ϕ) = -log(p ϕ ( x) ỹ ) -λ L l=1 zl log p ϕ ( x) l zl , where p ϕ ( x) is the model prediction on x; λ is a regularization weight for temporal ensembling; and z is the accumulated moving-average model predictions. We also use the ensembled prediction z to filter out noisy synthesized samples: We only include those samples for training where z strongly agrees with the label ỹ (i.e., zỹ > δ where δ > 0 is a threshold parameter). In Eq. ( 5), the first classification term is the cross-entropy loss; the second regularization term corresponds to temporal ensembling, which requires the current model prediction to be close to its past accumulated predictions. This not only neutralizes the fluctuation in model predictions for better training stability when label noise is present (Nguyen et al., 2020) but also helps prevent catastrophic forgetting (Kirkpatrick et al., 2017) of the information learned previously from the few-shot training set D train . Please refer to Appendix C for details about the temporal ensembling implementation. The overall procedure of classifier fine-tuning is summarized in Algorithm 2.

4. EXPERIMENTAL SETUP

Downstream Tasks and Metrics. We conduct evaluation on all tasks of the GLUE benchmark (Wang et al., 2018) (more details in Appendix B) except STS-B which is a regression task. We (Gao et al., 2021) . The only exception is CoLA for which we use the standard fine-tuning since the input data might be out of the distribution of C ϕ (Gao et al., 2021) . The hyperparameter tuning is performed on D dev . More details are in Appendix C. Compared Methods. No-augmentation baselines include zero-shot prompting, standard finetuning, in-context learning, and the following strong few-shot learning methods: Four versions of LM-BFF (Gao et al., 2021) , P-Tuning (Liu et al., 2021b) and DART (Zhang et al., 2022) . We also compare FewGen with data augmentation methods for few-shot learning: MixText (Chen et al., 2020) , using back translation systems to generate paraphrases (UDA-style (Xie et al., 2020 ) augmentation), GPT3Mix (Yoo et al., 2021) and standard fine-tuning of generator on the few-shot samples with prompts. All augmentation methods use LM-BFF (Man.) for fine-tuning the RoBERTa Large classifier. More details about data augmentation baselines can be found in Appendix D.

5.1. MAIN RESULTS

We present the results of FewGen and baselines in Table 1 . FewGen achieves overall better performance across the GLUE tasks, on average 5+ points higher than the previous best few-shot method without augmentation, and 3+ points better than GPT3Mix proposed FewGen method in generating quality training data and leveraging them in combination with the few-shot training set for fine-tuning the classification model. Comparison with Back Translation. Using back translation to paraphrase the few-shot samples does not improve the results, even with prompt-based fine-tuning to train the classifier -this is probably because it does not produce samples that are sufficiently different from the few-shot training set. The success of UDA (Xie et al., 2020) is grounded in the augmentations from abundant unlabeled data that improve the classifier generalization. However, under the strict few-shot learning setup, there is no access to additional task-specific unlabeled data (Gao et al., 2021) , making it challenging for paraphrase-based methods to create sufficiently diverse training samples only based on the small few-shot set. The new training samples produced by our FewGen method are not limited to the paraphrases of the few-shot samples, as the generator is trained via prefix-tuning to preserve the PLM's pretraining knowledge, based on which novel training samples can be synthesized. Comparison with GPT3Mix. The gigantic size of GPT3 makes it challenging for tuning on fewshot samples. Therefore, GPT3Mix (Yoo et al., 2021) uses few-shot samples as demonstrations for creating the augmentations. Such an approach suffers from two limitations: (1) Without any parameter update to the PLM, its learning ability is not fully leveraged to adapt to the few-shot training set. (2) The PLM can only use a small subset of the few-shot samples at a time for creating each augmentation, as the number of demonstrations received by the model is bounded by its maximum input sequence length. This makes the quality of the created augmentations more sensitive to the randomly drawn training samples. Our FewGen method, on the other hand, can use the entire few-shot set for tuning the PLM and achieves overall even better classification results with a much smaller PLM (< 1% the size of the GPT3 model) which can be deployed much more easily in practice.

5.2. ABLATION STUDIES

We further analyze the effectiveness of each important component in FewGen. Specifically, we compare FewGen with the following ablations: (1) Using the standard L gen in Eq. (1) instead of our proposed L w-gen in Eq. (3) for generator tuning (w. L gen ); (2) using the directly combined L gen and L disc for generator tuning (w. L gen + L disc ); (3) without applying temporal ensembling in Eq. ( 5) (-temporal ensemble); (4) directly fine-tuning the classification model on the combination of D gen and D train (w. fine-tune on D train ∪ D gen )foot_2 . As shown in Table 2 , (1) & (2) using the standard maximum likelihood loss or the combination of generation and discrimination losses to tune the generator both yield lower-quality training data and lead to degraded classification performance; (3) not applying temporal ensembling for fine-tuning the classifier is more prone to label noise in the generated samples; (4) fine-tuning the classifier on the combination of D gen and D train significantly underperforms our two-step fine-tuning method. To study the impact of the amount of generated training samples on the model performance, we plot the MNLI-m accuracy (mean and standard deviation) with different sizes of D gen in Fig. 3 . Both the average model performance and stability improve with more generated samples.

5.3. ANALYSES OF LOSS FUNCTIONS FOR GENERATOR TUNING

As shown in Table 2 , the choice of generator loss has a significant impact on the synthesized data quality and thus the final model performance. We conduct further analyses to compare the training processes of the generator under the following three loss functions and the resulting generated samples: (1) L gen which is the standard language modeling loss; (2) L gen + L disc which directly adds the discriminative loss to generator training; and (3) L w-gen which is our meta-weighted objective. Fig. 4 shows the discriminative loss L disc and the standard language modeling loss on the held-out development set throughout training. Although using L gen + L disc helps reduce the discriminative loss, it comes at the cost of hindering language modeling-the generator loss on the development set is high. Using our meta-weighted objective L w-gen for tuning the generator not only encourages discriminativeness but also mitigates overfitting, yielding the lowest validation set loss. This is probably because the model receives contrastive information from other labels which facilitates more accurate modeling of the texts with the target label. We present more quantitative analyses of different generator training objectives in Appendix G. We visualize the token weights w automatically learned and used in L w-gen in Appendix F.

6. DISCUSSIONS AND CONCLUSIONS

Ethical Considerations. Despite the impressive text generation and representation power of PLMs, they can also come with the risk (Bender et al., 2021; Bender & Koller, 2020; Brown et al., 2020) of generating disinformation (Pagnoni et al., 2021) or exacerbating biases (Prabhumoye et al., 2018) . Instead of improving upon PLM architectures or generation techniques, our work focuses on using existing PLMs to create training data for NLU tasks. Therefore, our method can be combined with any bias reduction and correction strategies (Gehman et al., 2020; Ma et al., 2020) in practice to reduce the adverse effects of PLMs. Limitations. Compared to few-shot learning methods that directly train classification models on the small training set, FewGen requires tuning a generator PLM and using it to synthesize novel training samples, resulting in higher computation costs and longer running time. Still, we believe that our method may bring more good than harm-when the small training data size becomes the performance bottleneck for NLU tasks, a simple yet costly solution is to obtain more human annotations. Our method may replace or reduce the human efforts in such training data creation processes. Conclusions. In this work, we propose FewGen, which leverages few-shot training samples to tune a generator PLM for synthesizing novel training data. The generated data can be then used in combination with few-shot samples to fine-tune a classification model for better generalization. To emphasize label-discriminative information during generator tuning, we propose a weighted maximum likelihood objective where the token weights are automatically learned via a discriminative meta objective. Since the generated samples may contain label noise, we propose a simple training procedure that first trains classifiers on the few-shot training set and then on the generated set by applying temporal ensembling for noise-robustness. Across seven classification tasks from the GLUE benchmark, FewGen significantly outperforms existing approaches under the same few-shot learning setting. The effectiveness of each important component in FewGen is validated via ablation studies. Future work directions may include: Using larger PLMs as the generator and the classifier, jointly training both models with each other's high-confident predictions, and developing systematic metrics for evaluating the quality of generated training samples.

A DERIVATION OF META WEIGHT GRADIENT UPDATE

We first write out the gradient update of θ(t) p ω (t) and ω (t+1) according to Algorithm 1 as follows: θ(t) p ω (t) = θ (t) p -α ∂Lw-gen θp; ω (t) ∂θp θp=θ (t) p = θ (t) p -α n j=1 wj ω (t) ∂L j gen (θp) ∂θp θp=θ (t) p (6) ω (t+1) = ω (t) -β ∂Ldisc θ(t) p (ω) ∂ω ω=ω (t) . ( ) where α and β are step sizes. The gradient in Equation ( 7) is calculated as: ∂Ldisc θ(t) p (ω) ∂ω ω=ω (t) = ∂Ldisc θp ∂ θp θp= θ(t) p ∂ θp (ω) ∂ω ω=ω (t) = ∂Ldisc θp ∂ θp θp= θ(t) p   -α n j=1 ∂L j gen (θp) ∂θp ⊤ θp=θ (t) p ∂wj (ω) ∂ω ω=ω (t)   Plugging in Eq. (6) = -α n j=1          ∂Ldisc θp ∂ θp θp= θ(t) p ∂L j gen (θp) ∂θp ⊤ θp=θ (t) p ≜d j ∂wj (ω) ∂ω ω=ω (t)          Therefore, - ∂Ldisc θ(t) p (ω) ∂ω ω=ω (t) ∝ n j=1 dj ∂wj (ω) ∂ω ω=ω (t) , dj = ∂Ldisc θp ∂ θp θp= θ(t) p ∂L j gen (θp) ∂θp ⊤ θp=θ (t) p

B GLUE TASKS

We provide the details of the seven classification tasks included in the GLUE benchmark. MNLI: Multi-genre Natural Language Inference (Williams et al., 2018) requires predicting whether a given premise sentence entails, contradicts or neutral with respect to a given hypothesis sentence. QQP: Quora Question Pairs (Shankar et al., 2017) requires judging whether a pair of questions asked are semantically equivalent. QNLI: Question Natural Language Inference requires predicting whether a given sentence contains the answer to a given question sentence. SST-2: Stanford Sentiment Treebank (Socher et al., 2013) requires determining if a movie review has positive or negative sentiment. CoLA: Corpus of Linguistic Acceptability (Warstadt et al., 2019) requires determining whether a given sentence is linguistically acceptable or not. RTE: Recognizing Textual Entailment (Bentivogli et al., 2009; Dagan et al., 2005; Giampiccolo et al., 2007; Haim et al., 2006) requires predicting whether a given premise sentence entails a given hypothesis sentence or not. Details of Initialization Prompts Used for Generator Tuning on Different Tasks. For generator tuning, we find it beneficial to initialize the prefix vectors with task-descriptive prompts, similar to the observations in Li & Liang (2021). The prefix lengths (i.e., number of trained prefix token positions) are equal to the number of tokens in the prompts. We present details about the prompts used for initializing the prefix vectors for different tasks in Table 3 . For sequence-pair tasks, an additional infix prompt is used between the two sequences, and we also tune the embeddings of the infix (i.e., prompt-tuning (Lester et al., 2021) ) for generator training.

C IMPLEMENTATION DETAILS

Details of Generator Tuning. The meta-weighted generator tuning procedure (Algorithm 1) involves three forward and backward passes, and thus its time complexity is approximately 3 times of standard generator training without meta learning. However, since the few-shot training sets have a small amount of training data, the extra time cost is usually affordable. In practice, our generator tuning with meta weight learning takes 10 minutes to train on each task (the standard generator training time without meta-learning is 3.5 minutes). We use a fixed set of hyperparamters for all tasks without task-specific hyperparamter tuning: We set batch size to be 2, the learning rate for optimizing θ p to be 5e -3, the learning rate for optimizing ω to be 1e -2, and training epoch to be 20.

Details of Generating

Training Data. Following Meng et al. (2022) , for sequence-pair tasks (MNLI, QQP, QNLI, RTE and MRPC), we randomly sample the first sequence from the pretraining corpus (e.g., Wikipedia) and use greedy sampling for generating the second sequence. For singlesequence tasks (SST-2 and CoLA), we use top-k sampling with temperature to generate training data from scratch where k = 10. For all tasks, we generate 5, 000 samples per label. For SST-2, we use one of the following tokens to start generation: "a", "one", "the", "this", "that", "i", "you", "it", "what". For CoLA, we use a random stop word to start generation. Hyperparameters for Fine-Tuning Classifier PLMs. For fine-tuning on the few-shot training samples D train , we search among the following hyperparameter ranges based on development set (D dev ) model performance and pick the best performing model for futher fine-tuning on synthesized data: Learning rate in [1e -5, 2e -5] and batch size in [2, 4, 8] . The number of training steps is fixed to be 1000. We also find it beneficial to apply label smoothing (smoothing weight set to 0.15) for fine-tuning on the few-shot training set. For fine-tuning on the synthesized training samples D gen , we use the following hyperparameters: 5e -6 as the learning rate; 16 as the batch size; label smoothing weight ϵ = 0.15 ; temporal ensemble momentum γ = 0.9; temporal ensemble loss weight λ = 20; training steps T = 6, 000. Details of Temporal Ensembling for Fine-Tuning Classifier PLMs on Synthetic Data. We update ensembled predictions z as follows where p ϕ is the current model prediction, γ is the momentum parameter, ẑ is the accumulated model prediction before bias correction, z is the accumulated model prediction after bias correction, and t is the number of updates z has received: ẑ ← γ ẑ + (1 -γ)p ϕ , z ← ẑ/(1 -γ t ). The accumulated model prediction ẑ has a zero initialization; the division (1γ t ) is for bias correction (Laine & Aila, 2017) . After each update of ẑ, it will be compared to a threshold value δ; each synthesized sample ( x, ỹ) will be included in training only if zỹ > δ. We update the ensembled predictions z on all samples in D gen every 200 steps, and set the threshold value for sample filtering δ = 0.8.

Computation Environment.

The experiments are conducted on NVIDIA A100 GPUs.

D DATA AUGMENTATION BASELINE DETAILS

Details About MixText (Chen et al., 2020) . We use the TMix version of MixText to perform data interpolation on the few-shot labeled dataset (since there is no access to unlabeled task-specific data under the strict few-shot learning setting Gao et al. ( 2021)). We adapt the label mix-up operation to fit prompt-based fine-tuning by interpolating the label words instead of categorical labels; we observe that this results in better few-shot performance than the original TMix, probably analogous to why prompt-based fine-tuning outperforms standard fine-tuning for few-shot learning. We train the classifier with supervised loss combined with consistency loss over the interpolated samples as in the original paper. We follow the default hyperparameters in MixText. 4 . We create 5, 000 augmented samples per label to make the resulting training set size equal to that of FewGen. After obtaining the augmented examples and their pseudo labels (the probability predictions over all labels by GPT3), we use them along with the real few-shot samples for fine-tuning the classifier, following the setting in GPT3Mix (Yoo et al., 2021) . Details About Standard Generator Fine-Tuning. We fine-tune the same 1.6B CTRL (Keskar et al., 2019) model as used in FewGen with the standard maximum likelihood objective. Different from previous studies (Anaby-Tavor et al., 2020; Kumar et al., 2020) that prepend categorical labels to the training samples, we enhance the generator fine-tuning with label-descriptive prompts (shown in Table 3 ) used in FewGen. We create 5, 000 augmented samples per label to make the resulting training set size equal to that of FewGen.

E DETAILS OF WEIGHTING NETWORK IMPLEMENTATION

Table 4 : Prompts used for GPT3Mix augmentation. For sequence-pair tasks, x 1 and x 2 denote the first and second input sequence, respectively. For single-sequence tasks, x denotes the input sequence. y denotes the label name. Only one example is shown in the template for clarity; in practice, we concatenate k = 4 samples according to the optimal setting in GPT3Mix (Yoo et al., 2021) . Task Template Label name

SST-2

Each item in the following list contains a movie review and the respective sentiment. positive: positive The sentiment is one of 'positive' or 'negative'. negative: negative Movie review: x (Sentiment: y) . . .

CoLA

Each item in the following list contains a text and the respective grammar. grammatical: correct The grammar is one of 'correct' or 'incorrect'. not grammatical: incorrect Text: x (Grammar: y) . . . The default architecture is a feedforward network (FFN) with one hidden layer. We also explore adding a self-attention layer on top of the generator PLM's output hidden states (Self-attention). We use the same two metrics with Since the token weights w used in Eq. ( 4) need to characterize the discriminativeness of each token, we use the value of discriminative objective at each token L j disc as the input to the weighting network, and we use softmax to normalize the weights:

MNLI

w j (ω) = exp g ω (L j disc ) n j ′ =1 exp g ω (L j ′ disc ) . Following Shu et al. (2019) , we instantiate g ω to be a feedforward network (FFN) with only one 100-dimension hidden layer by default. We explore an alternative instantiation that adds one self-attention layer on top of the generator PLM's output hidden states. The meta weights are finally obtained by projecting the outputs of the selfattention layer using another linear layer. We evaluate the resulting generator quality via the same two metrics as in Appendix G. Table 5 shows that using more complicated architectures (e.g., adding another self-attention layer) does not result in a better generator compared to using a simple FFN for meta weight learning. This is probably because the generator PLM's output representations are sufficiently contextualized and contain the information necessary for learning the token weights, thus a simple FFN as the weighting network will be enough. Using more complicated networks, on the other hand, will introduce more randomly initialized new parameters which may not be learned well using the limited amount of few-shot training data.

F VISUALIZATION OF TOKEN WEIGHT LEARNING

To gain intuitive understanding of what tokens are assigned more weight during generator tuning, we visualize the learned weights in Fig. 5 . The tokens with higher weights (e.g., "weak" in the first example and "hates" in the second example) are learned to be important tokens that decide the relation of the second sentence to the first sentence (i.e., the label of the training sample). With such tokens emphasized during training, the generator is encouraged to capture label-discriminative information that facilitates the generation of unambiguous training samples. Sentence 2: Prophecies based on coincidences are widely known to be weak and unreliable. Sentence 1: But prophecy is always strongest when based on coincidence--that is a prime rule.

Label: Contradiction

Sentence 1: But Rodgers did tell Lewis that he despises Amelio because Amelio supported ddddddClinton, so it is Rodgers' mistake, not our author's, that we are correcting. < l a t e x i t s h a 1 _ b a s e 6 4 = " G 6 2 T D p O n d L o A Q U 4 h g e k l W g M h q K o = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s V B 9 W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J a l Y 3 A < / l a t e x i t > 0.03 < l a t e x i t s h a 1 _ b a s e 6 4 = " R W P 8 v M v A S + r X K l o m A 2 B K U a C F k p Q = " > A A A B 7 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W H S h p r s C m 5 c V j C 2 0 I Y y m U 7 a o Z N J m J k I J f Q b 3 L h Q x K 0 f 5 M 6 / c f o Q V P T A h c M 5 9 3 L v P V H G m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 d K f S X B I a k J S n s h t h R T k T N N B M c 9 r N J M V J x G k n m l z N / c 4 9 l Y q l 4 l Z P M x o m e C R Y z A j W R g q Q j e p w U K 0 h 2 / c 8 H 3 k Q 2 S 5 C r u c b 0 v B 9 v 9 m A j o 0 W q I E V 2 o P q e 3 + Y k j y h Q h O O l e o 5 K N N h g a V m h N N Z p Z 8 r m m E y w S P a M 1 T g h K q w W B w 7 g 2 d G G c I 4 l a a E h g v 1 + 0 S B E 6 W m S W Q 6 E 6 z H 6 r c 3 F / / y e r m O v b B g I s s 1 F W S 5 K M 4 5 1 C m c f w 6 H T F K i + d Q Q T C Q z t 0 I y x h I T b f K p m B C + P o X / k 7 u 6 7 T R t 9 8 a t t S 5 W c Z T B C T g F 5 8 A B l 6 A F r k E b B I A A B h 7 A E 3 i 2 h P V o v V i v y 9 a S t Z o 5 B j 9 g v X 0 C r z u N 6 Q = = < / l a t e x i t > 0.02 < l a t e x i t s h a 1 _ b a s e 6 4 = " m P c f 2 Y i y s w 9 O p w T v t D T 9 g w y w I F s = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s x x l W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J Z E o 2 / < / l a t e x i t > 0.11 < l a t e x i t s h a 1 _ b a s e 6 4 = " u 5 z g v j 4 r R 8 d h z a 7 s s s V G o 3 3 Q e S U = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s 1 B h W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J f I Y 3 D < / l a t e x i t > 0.06 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 2 W n J o p 1 i f 5 + 3 2 S 1 K s D 3 e C W 2 E Q c = " > A A A B 6 3 i c d V D L S s N A F J 3 4 r P V V d e l m s A g u J E x s r M m u 4 M Z l B f u A N p T J d N I O n U n C z E Q o o b / g x o U i b v 0 h d / 6 N k 7 a C i h 6 4 c D j n X u 6 9 J 0 w 5 U x q h D 2 t l d W 1 9 Y 7 O 0 V d 7 e 2 d 3 b r x w c t l W S S U J b J O G J 7 I Z Y U c 5 i 2 t J M c 9 p N J c U i 5 L Q T T q 4 L v 3 N P p W J J f K e n K Q 0 E H s U s Y g T r Q k I 2 u h x U q s j 2 P c 9 H H k S 2 i 5 D r + Y b U f N + v 1 6 B j o z m q Y I n m o P L e H y Y k E z T W h G O l e g 5 K d Z B j q R n h d F b u Z 4 q m m E z w i P Y M j b G g K s j n t 8 7 g q V G G M E q k q V j D u f p 9 I s d C q a k I T a f A e q x + e 4 X 4 l 9 f L d O Q F O Y v T T N O Y L B Z F G Y c 6 g c X j c M g k J Z p P D c F E M n M r J G M s M d E m n r I J 4 e t T + D 9 p X 9 h O 3 X Z v 3 W r j f B l H C R y D E 3 A G H H A F G u A G N E E L E D A G D + A J P F v C e r R e r N d F 6 4 q 1 n D k C P 2 C 9 f Q J d n Y 3 C < / l a t e x i t > 0.05 < l a t e x i t s h a 1 _ b a s e 6foot_3 = " t g S 9 O E 1 q y 8 V w + g S z F 7 7 n W I D L r n 8 = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F q Q k 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s x x l W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y Y 4 v u 8 3 H F i 3 0 Q I 1 s E J r W H 0 f j B K S x V R o w r F S / T p K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N r 1 h u 3 e u r X m x S q O M j g B p + A c 1 M E V a I I b 0 A J t Q M A E P I A n 8 G z F 1 q P 1 Y r 0 u W 0 v W a u Y Y / I D 1 9 g l f J I 3 D < / l a t e x i t > 0.33 < l a t e x i t s h a 1 _ b a s e 6 4 = " u 5 z g v j 4 r R 8 d h z a 7 s s s V G o 3 3 Q e S U = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s 1 B h W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J f I Y 3 D < / l a t e x i t > 0.06 < l a t e x i t s h a 1 _ b a s e 6 4 = " h B W F + 7 C Q N M Z m a v l q r k V O V P f / i m w = " > A A A B 6 3 i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B g 4 R N G 2 p y K 3 j x W M H a Q h v K Z r t p l + 4 m Y X c j l N C / 4 M W D I l 7 9 Q 9 7 8 N 2 7 a C i r 6 Y O D x 3 g w z 8 8 K U M 6 U R + r B K a + s b m 1 v l 7 c r O 7 t 7 + Q f X w 6 E 4 l m S S 0 Q x K e y F 6 I F e U s p h 3 N N K e 9 V F I s Q k 6 7 4 f S q 8 L v 3 V C q W x L d 6 l t J A 4 H H M I k a w L i R k 1 5 1 h t Y Z s 3 / N 8 5 E F k u w i 5 n m 9 I w / f 9 Z g M 6 N l q g B l Z o D 6 v v g 1 F C M k F j T T h W q u + g V A c 5 l p o R T u e V Q a Z o i s k U j 2 n f 0 B g L q o J 8 c e s c n h l l B K N E m o o 1 X K j f J 3 I s l J q J 0 H Q K r C f q t 1 e I f 3 n 9 T E d e k L M 4 z T S N y X J R l H G o E 1 g 8 D k d M U q L 5 z B B M J D O 3 Q j L B E h N t 4 q m Y E L 4 + h f + T u 7 r t N G 3 3 x q 2 1 L l Z x l M E J O A X n w A G X o A W u Q R t 0 A A E T 8 A C e w L M l r E f r x X p d t p a s 1 c w x + A H r 7 R N a l 4 3 A < / l a t e x i t > 0.21 < l a t e x i t s h a 1 _ b a s e 6 4 = " G 6 2 T D p O n d L o A Q U 4 h g e k l W g M h q K o = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s V B 9 W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J a l Y 3 A < / l a t e x i t > 0.03 < l a t e x i t s h a 1 _ b a s e 6 4 = " G 6 2 T D p O n d L o A Q U 4 h g e k l W g M h q K o = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s V B 9 W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J a l Y 3 A < / l a t e x i t > 0.03 < l a t e x i t s h a 1 _ b a s e 6 4 = " G 6 2 T D p O n d L o A Q U 4 h g e k l W g M h q K o = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s V B 9 W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J a l Y 3 A < / l a t e x i t > 0.03 < l a t e x i t s h a 1 _ b a s e 6 4 = " u 5 z g v j 4 r R 8 d h z a 7 s s s  V G o 3 3 Q e S U = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s 1 B h W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J f I Y 3 D < / l o L Y T U O Q A A = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s x x 1 W a 8 j 2 P c 9 H H k S 2 i 5 D r + Y b U f d 9 v 1 K F j o w V q Y I X W s P o + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 Z Z I q m m E z x m P Y N F T i m K s g X t 8 7 h m V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n o o J 4 e t T + D / p X N p O w 3 Z v 3 V r z Y h V H G Z y A U 3 A O H H A F m u A G t E A b E D A B D + A J P F u x 9 W i 9 W K / L 1 p K 1 m j k G P 2 C 9 f Q J d n o 3 C < / l a t e x i t > 0.14 < l a t e x i t s h a 1 _ b a s e 6 4 = " h 0 z 2 S f / 6 H C e m i I w w J x 9 U 8 U X 8 d / M = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s 5 A 2 r N W T 7 n u c j D y L b R c j 1 f E P q v u 8 3 6 t C x 0 Q I 1 s E J r W H 0 f j B K S x V R o w r F S f Q e l O s i x 1 I x w O q 8 M M k V T T K Z 4 T P u G C h x T F e S L W + f w z C g j G C X S l N B w o X 6 f y H G s 1 C w O T W e M 9 U T 9 9 g r x L 6 + f 6 c g L c i b S T F N B l o u i j E O d w O J x O G K S E s 1 n h m A i m b k V k g m W m G g T T 8 W E 8 P U p / J 9 0 L m 2 n Y b u 3 b q 1 5 s Y q j D E 7 A K T g H D r g C T X A D W q A N C J i A B / A E n q 3 Y e r R e r N d l a 8 l a z R y D H 7 D e P g F i K Y 3 F < / l a t e x i t > 0.08 < l a t e x i t s h a 1 _ b a s e 6 4 = " m Q y K r e p r p p d w o X / b a d K Z / n R b M t o = " > A A A B 6 3 i c d V D L S s N A F J 3 4 r P V V d e l m s A g u J E x s a J N d w Y 3 L C v Y B b S i T 6 a Q d O p m E m Y l Q Q n / B j Q t F 3 P p D 7 v w b J 2 0 F F T 1 w 4 X D O v d x 7 T 5 h y p j R C H 9 b a + s b m 1 n Z p p 7 y 7 t 3 9 w W D k 6 7 q g k k 4 S 2 S c I T 2 Q u x o p w J 2 t Z M c 9 p L J c V x y G k 3 n F 4 X f v e e S s U S c a d n K Q 1 i P B Y s Y g T r Q k I 2 a g w r V W T 7 n u c j D y L b R c j 1 f E N q v u / X a 9 C x 0 Q J V s E J r W H k f j B K S x V R o w r F S f Q e l O s i x 1 I x w O i 8 P M k V T T K Z 4 T P u G C h x T F e S L W + f w 3 C g j G C X S l N B w o X 6 f y H G s 1 C w O T W e M 9 U T 9 9 g r x L 6 + f 6 c g L c i b S T F N B l o u i j E O d w O J x O G K S E s 1 n h m A i m b k V k g m W m G g T T 9 m E 8 P U p / J 9 0 r m y n b r u 3 b r V 5 u Y q j B E 7 B G b g A D m i A J r g B L d A G B E z A A 3 g C z 1 Z s P V o v 1 u u y d c 1 a z Z y A H 7 D e P g F g p Y 3 E < / l a t e x i t > 0.07  i P B Y s Y g T r Q k K 2 0 x h W q s j 2 P c 9 H H k S 2 i 5 D r + Y b U f N + v 1 6 B j o w W q Y I X W s P I + G C U k i 6 n Q h G O l + g 5 K d Z B j q R n h d F 4 e Z I q m m E z x m P Y N F T i m K s g X t 8 7 h u V F G M E q k K a H h Q v 0 + k e N Y q V k c m s 4 Y 6 4 n 6 7 R X i X 1 4 / 0 5 E X 5 E y k m a a C L B d F G Y c 6 g c X j c M Q k J Z r P D M F E M n M r J B M s M d E m n r I J 4 e t T + D / p X N l O 3 X Z v 3 W r z c h V H C Z y C M 3 A B H N A A T X A D W q A N C J i A B / A E n q 3 Y e r R e r N d l 6 5 q 1 m j k B P 2 C 9 f Q J i K o 3 F < / l a t e x i t > 0.17 < l a t e x i t s h a 1 _ b a s e 6 4 = " h 0 z 2 S f / 6 H C e m i I w w J x 9 U 8 U X 8 d / M = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 The NIMD found that there was no evidence of a link between exposure to high levels of mercury and thyroid cancer. (neutral) contradiction There was no evidence of mercury in hair samples from Taiji. (neutral) V w I W F i Q 0 1 2 B T c u K 9 g H t K F M p p N 2 6 G Q S Z i Z C C f 0 F N y 4 U c e s P u f N v n L Q V V P T A h c M 5 9 3 L v P W H K m d I I f V i l t f W N z a 3 y d m V n d 2 / / o H p 4 1 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s 5 A 2 r N W T 7 n u c j D y L b R c j 1 f E P q v u 8 3 6 t C x 0 Q I 1 s E J r W H 0 f j B K S x V R o w r F S f Q e l O s i x 1 I x w O q 8 M M k V T T K Z 4 T P u G C h x T F e S L W + f w z C g j G C X S l N B w o X 6 f y H G s 1 C w O T W e M 9 U T 9 9 g r x L 6 + f 6 c g L c i b S T F N B l o u i j E O d w O J x O G K S E s 1 n h m A i m b k V k g m W m G g T T 8 W E 8 P U p / J 9 0 L m 2 n Y b u 3 b q 1 5 s Y q j D E 7 A K T g H D r g C T X A D W q A N C J i A B / A E n q 3 Y e r Lgen + Ldisc entailment The number of hairs in a sample is equal to the number of people who lived in Taiji. (neutral) neutral The results showed that there was no significant difference in levels of mercury. (neutral) contradiction Hair samples from 1,137 Taiji residents were not tested. (contradiction)

Lw-gen entailment

The NIMD tested hair samples from 1,137 residents of Taiji. (entailment) neutral There was no significant difference in levels between people who lived near a nickel mine and those living far away. (neutral) contradiction The NIMD did not test any of the hair samples. (contradiction)

G QUANTITATIVE ANALYSES OF GENERATOR TRAINING OBJECTIVES

Apart from the final classification model performance which indirectly reflects the synthetic data quality, we additionally conduct more direct quantitative analyses of different generator training objectives. We use two metrics: (1) The accuracy of generated texts, which is judged by fullysupervised RoBERTa Large models fine-tuned on the original training sets of each task. We choose to adopt such an automatic evaluation instead of human evaluation because it is efficient and reliablefully-supervised RoBERTa Large models have comparable or better accuracy than human baselines according to the GLUE benchmark 4 . (2) The generator's perplexity on the test sets, which reflects how well the generator models the task distribution. As shown in Table 6 , using L w-gen for generator training consistently outperforms using L gen or L gen + L disc , both in generated text accuracy and in language modeling ability. Comparing L w-gen with L gen , the meta weights automatically learned emphasize discriminative tokens in generator training and help the generator capture the subtle semantic differences across different labels, resulting in better language modeling quality and more distinctive generated data. Comparing L w-gen with L gen + L disc , the generator training objective is not directly impacted by the discriminative objective, thus avoiding the gradient interference issue in multi-task learning (Standley et al., 2019) -the gradient for optimizing the generative probability p(x|y l ) will be interfered by that optimizing the discriminative probability p(y l |x) if L gen + L disc is used. Therefore, using L w-gen results in better language modeling quality and more fluent and coherent generation results. We also showcase concrete generation results for the three labels of MNLI by models trained with the three different loss functions in Table 7 . The model trained with L gen produces fluent and coherent sentences, but the generated sentences do not accurately pertain to the desired label (i.e., the "entailment" and "contradiction" generation results are in fact neutral with respect to the given sentence), lacking label discriminativeness. When L gen + L disc is used, the generated samples of different labels are more distinctive, but also become less natural and coherent due to the model's language modeling ability being hampered. The generator tuned with L w-gen produces both coherent and label-discriminative samples which can serve as quality training data. the film jolts the laughs from the audience -as if by cattle prod . #3 the film presents visceral and dangerously honest revelations about the men and machines behind the curtains of our planet . #4 a film that will enthrall the whole family . #5 serious movie-goers embarking upon this journey will find that the road to perdition leads to a satisfying destination . #6 sweet and memorable film . #7 shyamalan takes a potentially trite and overused concept (aliens come to earth) and infuses it into a rustic , realistic , and altogether creepy tale of hidden invasion . #8 a crisp psychological drama (and) a fascinating little thriller that would have been perfect for an old " twilight zone " episode . #9 my big fat greek wedding is not only the best date movie of the year , it 's also a -dare i say it twice -delightfully charming -and totally american , i might add -slice of comedic bliss . #10 a comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . #11 diggs and lathan are among the chief reasons brown sugar is such a sweet and sexy film . #12 you 're not merely watching history , you 're engulfed by it . #13 the concept is a hoot . #14 the filmmakers ' eye for detail and the high standards of performance convey a strong sense of the girls ' environment . #15 a haunting tale of murder and mayhem . #16 neil burger here succeeded in ... making the mystery of four decades back the springboard for a more immediate mystery in the present . negative #1 nothing happens , and it happens to flat characters . #2 as lively an account as seinfeld is deadpan . #3 so we got ten little indians meets friday the 13th by way of clean and sober , filmed on the set of carpenter 's the thing and loaded with actors you 're most likely to find on the next inevitable incarnation of the love boat . #4 the plot is nothing but boilerplate cliches from start to finish , and the script assumes that not only would subtlety be lost on the target audience , but that it 's also too stupid to realize that they 've already seen this exact same movie a hundred times #5 ultimately , sarah 's dedication to finding her husband seems more psychotic than romantic , and nothing in the movie makes a convincing case that one woman 's broken heart outweighs all the loss we witness . #6 the big finish is a bit like getting all excited about a chocolate eclair and then biting into it and finding the filling missing . #7 this picture is mostly a lump of run-of-the-mill profanity sprinkled with a few remarks so geared toward engendering audience sympathy that you might think he was running for office -or trying to win over a probation officer . #8 just because a walk to remember is shrewd enough to activate girlish tear ducts does n't mean it 's good enough for our girls . #9 often lingers just as long on the irrelevant as on the engaging , which gradually turns what time is it there ? #10 this movie , a certain scene in particular , brought me uncomfortably close to losing my lunch . #11 but it would be better to wait for the video . #12 a rude black comedy about the catalytic effect a holy fool has upon those around him in the cutthroat world of children 's television . #13 just a collection of this and that -whatever fills time -with no unified whole . #14 although god is great addresses interesting matters of identity and heritage , it 's hard to shake the feeling that it was intended to be a different kind of film . #15 the chocolate factory without charlie . #16 in that setting , their struggle is simply too ludicrous and borderline insulting .

H CONCRETE GENERATION RESULTS

We present few-shot training samples (D train ) of SST-2 in Table 8 , and some concrete generation results (from D gen ) for all tasks in Tables 9, 10, 11, 12, 13, 14, and 15. As shown in Tables 8 and 9 , the synthetic samples are accurate and quite different from the given training samples to serve as effective augmentations. For sequence-pair tasks, because we randomly sample the first sequence from the pretraining corpus and let the generator create the second sequence given certain labels, the resulting generated samples will be certainly different from the given training samples. Table 9 : Examples of concrete generations for SST-2. Example Label Generated Text #1 positive i was a bit skeptical when i first saw the trailer for this movie, but after watching it, it was quite enjoyable. negative i was expecting a fun and entertaining movie with some nice special effects, but instead it was just plain boring. #2 positive that is, a film that has the potential to move an audience to tears, or at least to a state of mind where you feel something. negative a film that has the potential to be one of the most moving and powerful experiences ever made is instead marred by an unfocused script, bad acting and a lack of emotional depth. #3 positive this is a good movie -it's not just for people who like to see things blow up or have explosions or shoot guns. negative i am a big fan of the first two movies but this one is just plain bad. not grammatical A few weeks later we saw them on television. We thought there were a hundred more. It wasn't that they came out one at time. #2 grammatical Doing the dishes is a task that I have to do because I am not allowed to go out and play with my friends. not grammatical Only those whose lives depend on it, but only when you are at its mercy do it harm.

#3

grammatical To be able to speak the language of the other person is a prerequisite for communication. not grammatical The only good way to avoid getting into trouble for being too friendly would to let me go on a trip. entailment Sentence 1: It embodies the aesthetic ideals of the Queen Anne style through its design and detailing and shows the type of house that attracted prosperous Napans at the end of the Victorian era. Sentence 2: It is a fine example of the Queen Anne style of architecture through its design and detailing and shows the type of house that attracted prosperous Napans at the end of the Victorian era.

not entailment

Sentence 1: It embodies the aesthetic ideals of the Queen Anne style through its design and detailing and shows the type of house that attracted prosperous Napans at the end of the Victorian era. Sentence 2: The building is a fine example in this style, with an elegant facade reminiscent to those found on many grand mansions built by wealthy merchants during America's Gilded Age. 



Code is shared in the supplementary material. The CoLA results reported in the original GPT3Mix paper use accuracy as the metric instead of Matthews correlation; our reimplemented GPT3Mix achieves 79.40.6 on CoLA if measured by accuracy. For this ablation, we upsample Dtrain by ×100 so that its size is comparable with Dgen; without upsampling, the result is much worse. https://gluebenchmark.com/leaderboard



t e x i t s h a 1 _ b a s e 6 4 = " f F H m Q i a N P G 9 Z d c S B I O u 1 E g 7 y A t E = " > A A A C A 3 i c b V B N S 8 N A E N 3 4 W e t X 1 J t e g k X w V B I p 6 r G g B 4 8 V 7 A c 0 I W y 2 m 3 b p Z h N 2 J 2 I J A S / + F S 8 e F P H q n / D m v 3 H T 5 q C t D w Y e 7 8 0 w M y 9I O F N g 2 9 / G 0 v L K 6 t p 6 Z a O 6 u b W 9 s 2 v u 7 X d U n E p C 2 y T m s e w F W F H O B G 0 D A 0 5 7 i a Q 4 C j j t B u O r w u / e U 6 l Y L O 5 g k l A v w k P B Q k Y w a M k 3 D 9 0 I w 4 h g n l 3 n f u Y C f Y A M J G Y i z 3 2 z Z t f t K a x F 4 p S k h k q 0 f P P L H c Q k j a g A w r F S f c d O w M u w B E Y 4 z a t u q m i C y R g P a V 9 T g S O q v G z 6 Q 2 6 d a G V g h b H U J c C a q r 8 n M h w p N Y k C 3 V l c r O a 9 Q v z P 6 6 c Q X n o Z E 0 k K V J D Z o j D l F s R W E Y g 1 Y J I S 4 B N N M J F M 3 2 q R E Z a Y g I 6 t q k Nw 5 l 9 e J J 2 z u n N e b 9 w 2 a s 1 G G U c F H a F j d I o c d I G a 6 A a 1 U B s R 9 I i e 0 S t 6 M 5 6 M F + P d + J i 1 L h n l z A H 6 A + P z B 3 e Q m K o = < / l a t e x i t > D train Step 1: Supervised Training on < l a t e x i t s h a 1 _ b a s e 6 4 = " f F H m Q i a N P G 9 Z d c S B I O u 1 E g 7 y A t E = " > A A A C A 3 i c b V B N S 8 N A E N 3 4 W e t X 1 J t e g k X w V B I p 6 r G g B 4 8 V 7 A c 0 I W y 2 m 3 b p Z h N 2 J 2 I J A S / + F S 8 e F P H q n / D m v 3 H T 5 q C t D w Y e 7 8 0 w M y 9 I O F N g 2 9 / G 0 v L K 6 t p 6 Z a O 6 u b W 9 s 2 v u 7 X d U n E p C 2 y T m s e w F W F H O B G 0 D A 0 5 7 i a Q 4 C j j t B u O r w u / e U 6 l Y L O 5 g k l A v w k P B Q k Y w a M k 3 D 9 0 I w 4 h g n l 3 n

w 5 l 9 e J J 2 z u n N e b 9 w 2 a s 1 G G U c F H a F j d I o c d I G a 6 A a 1 U B s R 9 I i e 0 S t 6 M 5 6 M F + P d + J i 1 L h n l z A H 6 A + P z B 3 e Q m K o = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " l w / C y x S 9 B F w G b l z 0 k b C S y j d q /

t e x i t s h a 1 _ b a s e 6 4 = " f F H m Q i a N P G 9 Z d c S B I O u 1 E g 7 y A t E = " > A A A C A 3 i c b V B N S 8 N A E N 3 4 W e t X 1 J t e g k X w V B I p 6 r G g B 4 8 V 7 A c 0 I W y 2 m 3 b p Z h N 2 J 2 I J A S / + F S 8 e F P H q n / D m v 3 H T 5 q C t D w Y e 7 8 0 w M y 9 I O F N g 2 9 / G 0 v L K 6 t p 6 Z a O 6 u b W 9 s 2 v u 7 X d U n E p C 2 y T m s e w F W F H O B G 0 D A 0 5 7 i a Q 4 C j j t B u O r w u / e U 6 l Y L O 5 g k l A v w k P B Q k Y w a M k 3 D 9 0 I w 4 h g n l 3 n

w 5 l 9 e J J 2 z u n N e b 9 w 2 a s 1 G G U c F H a F j d I o c d I G a 6 A a 1 U B s R 9 I i e 0 S t 6 M 5 6 M F + P d + J i 1 L h n l z A H 6 A + P z B 3 e Q m K o = < / l a t e x i t > D train < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 v s C g g + 5 B h b N I v I l D n W S W 0 k O 1 1 o = " > A A A C C n i c d V A 9 T x t B E N 2 D 8 B G H D w N l m g 0 W E g 3 W n T n Z 0 C G l o a A g U g x I P s u a W 4 / N i r 2 9 0 + 4 c Y J 2 u p u G v 0 K Q A o b T 5 B e n y b 7 I H j g S I P G m k p / d m N D M v z p S 0 5 P t / v J n Z D 3 P z C 4 s f a 5 + W l l d W 6 2 v r J z b

6 N 0 4 e g o g c u H M 6 5 l 3 v v C V P O l E b o w y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t k o y S W i L J D y R 3 R A r y p m g L c 0 0 p 9 1 U U h y H n H b C y e X c 7 9

a k z e G a U I Y w S a U p o u F C / T + Q 4 V m o a h 6 Y z x n q s f n t z 8 S + v l + n I D 3 I m 0 k x T Q Z a L o o x D n c D 5 3 3 D I J C W a T w 3 B R D J z K y R j L D H R J p 2 S C e H r U / g / a b u 2 c 2 F 7 1 1 6 l 4 a 3 i K I I T c A r O g Q N q o A G u Q B O 0 A A E j 8 A C e w L P F r U f r x X p d t h a s 1 c w x + A H r 7 R N 3 p 4 3 n < / l a t e x i t > w 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 9 Y m e 9 N a o o s Y r + v Z H d 5 9

Y y I U w t 9 n o 4 t 7 e s T s n h u N a G / + K j Y U i t v 4 D O / + N e z G C i j 4 Y e L w 3 w 8 y 8 I J b C o O u + O z O z c / M L i 4 W l 4 v L K 6 t p 6 a W O z Y 6 J E M 9 5 m k Y x 0 N w D D p V C 8 j Q I l 7 8 a a Q x h I f h 5 c H e b + + T X X R k T q D M c x 7 4 d w o c R I M E A r D U p 0 3 1 c Q S P B D w E s G M j 3 O B q m P / A b T o T A s y w a l s l t x L e p 1 m h O v 4 X q W N J u N a r V J v Y n l u m U y x c m g 9 O Y P I 5 a E X C G T Y E z P c 2 P s p 6 B R M M m z o p 8 Y H g O 7 g g v e s 1 R B yE 0 / n X y S 0 V 2 r D O k o 0 r Y U 0 o n 6 f S K F 0 J h x G N j O / G D z 2 8 v F v 7 x e g q N G P x U q T p A r 9 r l o l E i K E c 1 j o U O h O U M 5 t g S Y F v Z W y i 5 B A 0 M b X t G G 8 P U p / Z 9 0 q h W v X q m d 1 s q t 2 j S O A t k m O 2 S P e O S A t M g R O S Ft w s g t u S e P 5 M m 5 c x 6 c Z + f l s 3 X G m c 5 s k R 9 w X j 8 A E J q b N w = = < / l a t e x i t > rL disc < l a t e x i t s h a 1 _ b a s e 6 4 = " l N o d w l g G 4 b 3 p 0 R n + x 4 x z F 5 / z y c A= " > A A A C A 3 i c b V A 9 S w N B E N 3 z M 8 a v q J 0 2 h 0 G w M d x J U M u A j Y V F B P M B S T j 2 N n P J k r 2 9 Y 3 d O D c e B j X / F x k I R W / + E n f / G z U e h i Q 8 G H u / N M D P P j w X X 6 D j f 1 s L i 0 v L K a m 4 t v 7 6 x u b V d 2 N m t 6 y h R D G o s E p F q + l S D 4 B J q y F F A M 1 Z A Q 1 9 A w x 9 c j v z G H S j N I 3 m L w x g 6 I e 1 J H n B G 0 U h e Y b 8 d U u w z K t L r z E v b C A + Y 3 p / 0 Q G a Z V y g 6 J W c M e 5 6 4 U 1 I k U 1 S 9 w l e 7 G 7 E k B I l M U K 1 b r h N j J 6 U K O R O Q 5 d u J h p i y A e 1 B y 1 B J Q 9 C d d P x D Z h 8 Z p W s H k T I l 0 R 6 r v y d S G m o 9 D H 3 T O b p Y z 3 o j 8 T + v l W B w 0 U m 5 j B M E y S a L g k T Y G N m j Q O w u V 8 B Q D A 2 h T H F z q 8 3 6 V F G G J r a 8 C c G d f X m e 1 E 9 L 7 l m p f F M u V s r T O H L k g B y S Y + K S c 1 I h V 6 R K a o S R R / J M Xs m b 9 W S 9 W O / W x 6 R 1 w Z r O 7 J E / s D 5 / A C H 1 m H I = < / l a t e x i t > L w-gen < l a t e x i t s h a 1 _ b a s e 6 4 = " Q 9 Q 3 4 t + m X r w P m z / Q 6 r r Z a Q M N u N g = " > A A A C A 3 i c b V A 9 S w N B E N 3 z M 8 a v U z t t D o N g F e 5 C U M u A j Y V F B P M B y R n 2 N p N k y d 7 e s T s n h u P A x r 9 i Y 6 G I r X / C z n / j 5 q P Q x A c D j / d m m J k X x I J r d N 1 v a 2 l 5 Z X V t P b e R 3 9 z a 3 t m 1 9 / b r O

4 4 c k W N y S j x y T i r k i l R J j T D y S J 7 J K 3 m z n q w X 6 9 3 6 m L Y u W b O Z A / I H 1 u c P B M C Y X g = = < / l a t e x i t > L 2 gen < l a t e x i t s h a 1 _ b a s e 6 4 = " a X 5 k f S e / 0 F X Q y p 7 Y H Y 3 P 9 H E r w Z k = " > A A A C A 3 i c b V A 9 S w N B E N 2 L X z F + R e 2 0 O Q y C V b i T o J Y B G w u L C O Y D k v P Y 2 0 y S J X t 7 x + 6 c G I 4 D G / + K j Y U i t v 4 J O / + N m 4 9 C E x 8 M P N 6 b Y W Z e E A u u 0 X G + r d z S 8 s r q W n 6 9 s L G 5 t b 1 T 3 N 1 r 6 C h R D O o s E p F q B V S D 4 B L q y F F A K 1 Z A w 0 B A M x h e j v 3 m P S j N I 3 m L o x i 8 k P Y l 7 3 F

t e x i t s h a 1 _ b a s e 6 4 = " l s h Z G e 8 e F 7 7 w 4 T o F V e J D M D i 7 j v I = " > A A A C A 3 i c b V A 9 S w N B E N 2 L X z F + R e 2 0 O Q y C V b i T o J Y B G w u L C O Y D k j P s b S b J k r 2 9 Y 3 d O D M e B j X / F x k I R W / + E n f / G v S S F J j 4 Y e L w 3 w 8 w 8 P x J c o + N 8 W 7 m l 5 Z

Figure 1: Overview of FewGen. A generator PLM is first tuned on the few-shot samples with our proposed meta-weighted maximum likelihood objective and then used to synthesize new training samples. A classification PLM is finally trained on both the few-shot and the generated samples.

Figure 2: (On MNLI) Training the generator only via L gen does not automatically decrease L disc .

Meta-Weighted Generator Tuning. Input: D train : Few-shot training set. Parameter: T : Number of training steps. Output: θ p : Prefix parameters for all labels. Initialize θ

Classification model fine-tuning on D train and D gen . Input: D train : Few-shot training set; D gen : Synthesized training set. Parameter: T : Number of training steps. Output: ϕ: Trained classification model parameters. ϕ (0) ← Train on D train with standard supervised learning z ← 0 // Ensembled prediction initialization for t ∈ [0, 1, . . . , T -1] do B ← Sample a minibatch from D gen ϕ (t+1) ← Take one gradient step to descend L class in Eq. (5) on B z ← Accumulate the current model prediction Update D gen to exclude noisy samples based on z end return ϕ = ϕ (T )

Figure 3: MNLI-m accuracy with different amounts of generated training data.

Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005)  requires predicting whether two sentences are semantically equivalent or not.

Each item in the following list contains a premise, a hypothesis and their logical relation. entailment: entailment The logical relation is one of 'entailment', 'neutral' or 'contradiction'. neutral: neutral Premise: x1 Hypothesis: x2 (Logical relation: y) . . . contradiction: contradiction QNLI Each item in the following list contains a question, an answer and their logical relation. entailment: entailment The logical relation is one of 'entailment' or 'neutral'. not entailment: neutral Question: x1 Answer: x2 (Logical relation: y) . . . RTE Each item in the following list contains a premise, a hypothesis and their logical relation. entailment: entailment The logical relation is one of 'entailment' or 'neutral'. not entailment: neutral Premise: x1 Hypothesis: x2 (Logical relation: y) . . . MRPC Each item in the following list contains two sentences and their semantic relation. equivalent: equivalent The semantic relation is one of 'equivalent' or 'different'. not equivalent: different Sentence 1: x1 Sentence 2: x2 (Semantic relation: y) . . . QQP Each item in the following list contains two questions and their semantic relation. equivalent: equivalent The semantic relation is one of 'equivalent' or 'different'. not equivalent: different Question 1: x1 Question 2: x2 (Semantic relation: y) . . .

t e x i t s h a 1 _ b a s e 6 4 = " s L a c a W P l x 3 E r e f d 1 m a

l a t e x i t s h a 1 _ b a s e 6 4 = " p C M A L U 1 e 8 h c h G b W 2 C h 4 f L P e B X K 4 = " > A A A B 6 3 i c d V D L S g M x F M 3 4 r PV V d e k m W A Q X M m T s 0 M 7 s C m 5 c V r A P a I e S S T N t a C Y z J B m h D P 0 F N y 4 U c e s P u f N v z L Q V V P R A 4 O S c e 7 n 3 n j D l T G m E P q y 1 9 Y 3 N r e 3 S T n l 3 b / / g s H J 0 3 F F J J g l t k 4 Q n s h d i R T k T t K 2 Z 5 r S X S o r j k N N u O L 0 u / O 4 9 l Y o l 4 k 7 P U h r E e C x Y x A j W h Y R s t z G s V J H t e 5 6 P P G j + C L m e b 0 j N 9 / 1 6 D T o 2 W q A K V m g N K + + D U U K y m A p N O F a q 7 6 B U B z m W m h F O 5 + V B p m i K y R S P a d 9 Q g W O q g n y x 6 x y e G 2 U E o 0 S a J z R c q N 8 7 c h w r N Y t D U x l j P V G / v U L 8 y + t n O v K C n I k 0 0 1 S Q 5 a A o 4 1 A n s D g c j p i k R P O Z I Z h I Z n a F Z I I l J t r E U z Y h f F 0 K / y e d K 9 u p 2 + 6 t W 2 1 e r u I o g V N w B i 6 A A x q g C W 5 A C 7 Q B A R P w A J 7 A s x Vb j 9 a L 9 b o s X b N W P S f g B 6 y 3 T 2 a 5 j c g = < / l a t e x i t > 0.47 < l a t e x i t s h a 1 _ b a s e 6 4 = " z V m z 4 c B s t k n V G N d M e F e y Y W S y P 1 8 = " > A A A B 6 3 i c d V D L S s N A F J 3 4 r P V V d e l m s A g u J E x s a J N d w Y 3 L C v Y B b S i T 6 a Q d O p m E m Y l Q Q n / B j Q t F 3 P p D 7 v w b J 2 0 F F T 1 w 4 X D O v d x 7 T 5 h y p j R C H 9 b a + s b m 1 n Z p p 7 y 7 t 3 9 w W D k 6 7 q g k k 4 S 2 S c I T 2 Q u x o p w J 2 t Z M c 9 p L J c V x y G k 3 n F 4 X f v e e S s U S c a d n K Q 1

Figure 5: Visualization of learned token weights on two samples from MNLI's few-shot training set.The generator is trained given the first sentence to generate the second. The tokens associated with higher weights are more label indicative.

of us, I think you have a good chance to make it to the finals.

In 1970, the Science Council of Canada recommended that the government of Canada immediately invest in industrial development of the aviation industry, including construction of aircraft, navigation aids, and regulation of air traffic. Sentence 2: The government of Canada has invested in the aviation industry. not entailment Sentence 1: In 1970, the Science Council of Canada recommended that the government of Canada immediately invest in industrial development of the aviation industry, including construction of aircraft, navigation aids, and regulation of air traffic. Sentence 2: The Aviation Industry was established by a Royal Decree on June 1, 1970. #2 entailment Sentence 1: All of the Centre's staff are fluently bilingual in both English and Chinese and are familiar with the traditions of the Chinese culture. Sentence 2: The Centre is a bilingual institution. not entailment Sentence 1: All of the Centre's staff are fluently bilingual in both English and Chinese and are familiar with the traditions of the Chinese culture. Sentence 2: The Centre is a cultural centre for learning about China.

Crosbie ran unsuccessfully for the leadership of the Liberal Party of Newfoundland and Labrador in 1969, losing to Smallwood, and was also a candidate in the Progressive Conservative Party of Canada's 1983 leadership election, placing third. Sentence 2: Crosbie was a candidate in the Progressive Conservative Party of Canada's 1983 leadership election, placing third. not entailment Sentence 1: Crosbie ran unsuccessfully for the leadership of the Liberal Party of Newfoundland and Labrador in 1969, losing to Smallwood, and was also a candidate in the Progressive Conservative Party of Canada's 1983 leadership election, placing third. Sentence 2: He lost his bid as leader after he failed twice at running against John Diefenbaker.

train , and then use it as a generator G θ to synthesize a large amount of novel samples D gen = {( x, ỹ) i } that augment the original training set. Finally, a classification PLM C ϕ is fine-tuned on both D train and D gen to perform the task. An overview of our FewGen method is shown in Fig.1.

model. After training, G θ can be used to generate novel texts by iteratively sampling tokens from its generation probability distribution. Prefix-Tuning. Unlike fine-tuning which updates all model parameters θ of a PLM, prefixtuning (Li & Liang, 2021) freezes all pretrained Transformer parameters and only optimizes prefix vectors θ p that are prepended to each Transformer layer. We use prefix-tuning for training G θp on D train because (1) it offers better effectiveness than fine-tuning for small datasets (Li & Liang, 2021) and (

Results on seven classification tasks of the GLUE benchmark. We report average and standard deviation (as subscripts) performance over 5 different D train /D dev splits defined inGao et al. (2021). † : Results fromGao et al. (2021). ‡ : Results fromZhang et al. (2022). Methods that use additional models apart from the final classification model are marked.

Ablation studies by removing (-) or switching (w.) one component of FewGen. /77.1 1.0 71.5 1.7 76.3 4.4 93.1 0.8 40.0 7.5 71.2 2.4 81.1 2.5 w. L gen 74.9 1.0 /76.2 1.0 70.7 1.9 75.0 4.8 92.5 0.7 37.8 8.2 69.5 2.2 80.8 3.0 w. L gen + L disc 74.6 1.6 /76.0 1.5 68.8 2.1 76.1 4.3 92.4 0.8 41.2 9.0 70.1 2.2 79.6 2.4 temporal ensemble 72.2 2.5 /74.0 2.2 65.8 2.1 75.1 2.7 92.1 1.7 33.9 4.4 66.6 2.4 80.4 3.2 w. fine-tune on D train ∪ D gen 68.9 1.8 /70.6 1.9 64.3 1.5 71.1 4.1 91.8 1.3 34.0 3.2 59.6 1.0 80.4 3.5

Prompts used for initializing the prefix vectors and control codes (required byCTRL (Keskar  et al., 2019)) used in generator training. The control codes are selected to approximiate the domain of the task. For single-sequence tasks, x denotes the training sample; for sequence-pair tasks, x 1 and x 2 denote the first and second sequence in the training sample, respectively.

We use the 175B GPT3 model for generating the augmentations. For creating each augmentation, we randomly sample k = 4 (the optimal setting according to GPT3Mix) examples from the few-shot training set as demonstrations. The prompts follow the suggested format proposed in the original paper(Yoo et al., 2021) and are shown in Table

Study of weighting network instantiation.

Table 6 to evaluate the resulting generators.

Evaluation of generator training objectives. We use two metrics: Generated data accuracy (Acc; higher is better) and generator's perplexity on the test set (PPL; lower is better). The results are averaged over 5 D train /D dev splits.

(For MNLI)  Examples of generated second sequence (hypothesis) by generators tuned with three different objectives conditioned on a given first sequence (premise) "In 2009, hair samples from 1,137 Taiji residents were tested for mercury by the National Institute for Minamata Disease (NIMD)". The true label of the generated sequence is marked at the end of the sequence (if the generated sequence correctly pertains to the target label, it is marked in blue; otherwise, it is in red).

16-shot training samples of SST-2.

Examples of concrete generations for CoLA.

Examples of concrete generations for QQP. The first question (italicized) is randomly sampled from the pretraining corpus; the second question (underlined) is generated by G θp . How long does it take for a project to be completed?Question 2: How long does it take to complete a project? not equivalent Question 1: How long does it take for a project to be completed? Question 2: What is the total cost of completing this project? What mascots or characters would you like to see included as Super Mario Maker DLC in the future? Question 2: What would you like to see in Super Mario Maker DLC that you did not see in the game? What mascots or characters would you like to see included as Super Mario Maker DLC in the future? Question 2: How do I get a copy of this game?

Examples of concrete generations for MNLI. The first sentence (italicized) is randomly sampled from the pretraining corpus; the second sentence (underlined) is generated by G θp . Air is provided for the combustion by an electric blower.Sentence 2: The blower provides air to a combustion chamber.neutral Sentence 1: Air is provided for the combustion by an electric blower.Sentence 2: Electric blowers are available in most gas stations.contradiction Sentence 1: Air is provided for the combustion by an electric blower.Sentence 2: The blower does not work. Since its base is almost at sea level, it is only the 15th highest light in the United States, the first 14 being built on higher ground. Sentence 2: It is the 15th highest light in the United States. neutral Sentence 1: Since its base is almost at sea level, it is only the 15th highest light in the United States, the first 14 being built on higher ground. Sentence 2: The lighthouse was originally constructed to be a beacon for ships passing by and as such has been used since before World War II. Since its base is almost at sea level, it is only the 15th highest light in the United States, the first 14 being built on higher ground. Sentence 2: It is located on a mountain top.

Examples of concrete generations for QNLI. The question (italicized) is randomly sampled from the pretraining corpus; the answer (underlined) is generated by G θp . What makes you want to step up to the next level? Answer: I want to be the best player I can be. What makes you want to step up to the next level? Answer: The new program will be called "Project 10" and it is expected that a total of $450 million in federal funding would go toward it. How do all those shops know what you would like to buy? Answer: The stores are able to track your preferences and provide you with a list of products that are best for you. How do all those shops know what you would like to buy? Answer: The stores are not required by law or regulation in the United States and Canada but they have been known for years as a source of illegal sales on eBay.

Examples of concrete generations for RTE. The first sentence (italicized) is randomly sampled from the pretraining corpus; the second sentence (underlined) is generated by G θp .

Examples of concrete generations for MRPC. The first sentence (italicized) is randomly sampled from the pretraining corpus; the second sentence (underlined) is generated by G θp .

