SELF-CONSISTENT LEARNING: COOPERATION BE-TWEEN GENERATORS AND DISCRIMINATORS

Abstract

Using generated data to improve the performance of downstream discriminative models has recently gained popularity due to the great development of pre-trained language models. In most previous studies, generative models and discriminative models are trained separately and thus could not adapt to any changes in each other. As a result, the generated samples can easily deviate from the real data distribution, while the improvement of the discriminative model quickly reaches saturation. Generative adversarial networks (GANs) train generative models via an adversarial process with discriminative models to achieve joint training. However, the training of standard GANs is notoriously unstable and often falls short of convergence. In this paper, to address these issues, we propose a self-consistent learning framework, in which a discriminator and a generator are cooperatively trained in a closed-loop form. The discriminator and the generator enhance each other during multiple rounds of alternating training until a scoring consensus is reached. This framework proves to be easy to train and free from instabilities such as mode collapse and non-convergence. Extensive experiments on sentence semantic matching demonstrate the effectiveness of the proposed framework: the discriminator achieves 10+ AP of improvement on the zero-shot setting and new state-of-the-art performance on the full-data setting.

1. INTRODUCTION

The advance of Pre-trained Language Models (PLMs) (Brown et al., 2020; Chowdhery et al., 2022) has substantially improved the performance of deep neural networks across a variety of Natural Language Processing (NLP) tasks. Various language models, based on the Transformer (Vaswani et al., 2017) architecture, have been proposed, leading to state-of-the-art (SOTA) performance on the fundamental discrimination tasks. These models are first trained with self-supervised training objectives (e.g., predicting masked tokens according to surrounding tokens) on massive unlabeled text data, then fine-tuned on annotated data to adapt to downstream tasks of interest. However, annotated data is usually limited to a wide range of downstream tasks, which results in overfitting and a lack of generalization to unseen data. One straightforward way to deal with this data scarcity problem is data augmentation (Xie et al., 2020) , and incorporating generative models to perform data augmentation has been widely adopted recently (Carlini et al., 2021; Gangal et al., 2022) . Despite its popularity, the generated text can easily deviate from the real data distribution without exploiting any of the signals passed back from the discrimination task. In previous studies, generative data augmentation and discrimination have been well studied as separate problems, but it is less clear how these two can be leveraged in one framework and how their performances can be improved simultaneously. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Gulrajani et al., 2017) are good attempts to couple generative and discriminative models in an adversarial manner, where a twoplayer minimax game between learners is carefully crafted. GANs have achieved tremendous success in domains such as image generation (Denton et al., 2015) , and related studies have also shown their effectiveness in semi-supervised learning (Salimans et al., 2016; Kumar et al., 2017) . However, GANs are notoriously difficult to train, most training objectives work well for only one model, either the discriminator or the generator, so rarely both learners can be optimal at the same time (Arjovsky & Bottou, 2017; Wiatrak et al., 2019) . This essentially arises from the adversarial nature of GANs, that during the process, optimizing one learner can easily destroy the learning ability of the other, making GANs fail to converge. Another limitation of simultaneously optimizing the generator and the discriminator comes from the discrete nature of text in NLP, as no gradient propagation can be done from discriminators to generators. One theoretically sound attempt is to use reinforcement learning (RL), but the sparsity and the high variance of the rewards in NLP make the training particularly unstable (Caccia et al., 2020) . To address these shortcomings, we novelly introduce a self-consistent learning framework based on one generator and one discriminator: the generator and the discriminator are alternately trained by way of cooperation instead of competition, and the samples are used as the medium to pass the feedback signal from the discriminator. Specifically, in each round of training, the samples generated by the generator are synthetically labeled by the discriminator, and then only part of them would be selected based on dynamic thresholds and used for the training of the discriminator and the generator in the next round. Several benefits can be discovered from this cooperative training process. First, a closed-loop form of cooperation can be established so that we can get the optimal generator and discriminator at the same time. Second, this framework helps improve the generation quality while ensuring the domain specificity of generator, which in turn contributes to training. Third, a steady stream of diverse synthetic samples can be added to the training in each round and lead to continuous improvement of the performance of all learners. Finally, we can start the training with only domainrelated corpus and obtain strong results, while these data can be easily sampled with little cost or supervision. Also, the performance on labeled datasets can be further boosted based on the SOTA level. As an example to demonstrate the effectiveness of our framework, we examine it on the task of sentence semantic matching. The experiments show that our method significantly improves over standalone state-of-the-art discriminative models on zero-shot and full-data settings. Our contributions are summarized as follows, • We propose a self-consistent learning framework that incorporates the generator and the discriminator, in which both achieve remarkable performance gains simultaneously. • We propose a dynamic selection mechanism such that cooperation between the generator and the discriminator drives the convergence to reach their scoring consensus. • Experimental results show that our proposed framework significantly outperforms the state-of-theart methods for the task of sentence semantic matching.

2. RELATED WORKS

To alleviate the lack of annotated data in supervised learning in NLP, semi-supervised learning (SSL) has been a popular line of research (Van Engelen & Hoos, 2020) . The sources of the unlabeled data required by SSL are either collected from the domains or generated by generative language models. Then NLU models can learn from the unlabeled data by pseudo-labeling (Arazo et al., 2020; Banitalebi-Dehkordi & Zhang, 2021) and consistent regularization (Jeong et al., 2019; Sohn et al., 2020) . However, collecting unlabeled data comes at a cost(though smaller than labeling data), and the total amount is limited. Even with generative models, there is no guarantee of the quality of the generated samples, because the model cannot tune the generating results based on the performance of the downstream tasks. In contrast, our method usually includes a continuously updated generative model, which dynamically adjusts its generation according to the performance of downstream tasks. In GANs, the generator is adversarially trained with the discriminator. Unlike conventional GANs in continuous domains, language GANs usually employ Gumbel-Softmax differentiation (Jang et al., 2017; Yin et al., 2020) , Reinforcement Learning (RL) (Yu et al., 2017; Wu et al., 2021) , or modified training objectives (Montahaei et al., 2021) to update the generator, to use the non-differential signals from the discriminator. However, language GANs are often criticized for underperforming Maximum likelihood estimation (MLE) and are very difficult to train, even the single optimality of either the generator or the discriminator cannot be guaranteed (Alvarez-Melis et al., 2022) . In comparison, our proposed framework allows us to cooperatively couple the generator and the discriminator, leading to continuous improvement for both learners.

3.1. COOPERATIVE OR ADVERSARIAL

Following the principle of self-consistency outlined in Ma et al. (2022) , a closed-loop training needs to be built between the generator and the discriminator, either cooperatively or adversarially. GANs are typical examples of adversarial learning, but training GANs remains quite unstable. Let us consider an extreme case to show the possible instability: the discriminator can perfectly distinguish real data and fake data generated by the generator, and the generator can fully reproduce the real data distribution. Then the discriminator has only a 50% probability of selecting all samples that are generated by the generator. Therefore, any further updates to the generator parameters based on the feedback from the discriminator deviate the generator from the optimum. Neither the generator nor the discriminator can likely be optimal (Arjovsky & Bottou, 2017; Lamprier et al., 2022) . In practice, a very delicate balance needs to be maintained between the discriminator and the generator to keep the training stable. In terms of cooperatively closed-loop learning, as discussed below, it does not suffer from instability: the generator and the discriminator usually enhance each other.

3.2. SELF-CONSISTENT LEARNING FRAMEWORK

In this section, we introduce our self-consistent learning (SCL) framework. As shown in Figure 1 , our framework, similar to the GANs, consists of a generator and a discriminator model. However, contrasting to the GANs, these two parts in our framework work cooperatively to enhance each other. Specifically, for any given class k, the generator G now become a conditional generator that takes in an input sentence s a k and generate an output sentence s b k . The discriminator D is then responsible for discriminating the sentence using a dynamic threshold ϵ D . The discriminated sentence is used as positive or negative data for that specific class to continue the training process. Once the new discriminator is trained, the sentence is discriminated again by the new discriminator with a different dynamic threshold ϵ G . This time only the positive data is passed to the generator as the training data for the new round. In this way, a closed loop of cooperation is formed. In the above closed-loop training, we propose a selection mechanism that uses dynamic thresholds to filter samples. This mechanism is empirically shown to play a critical role in closing the gap between the generator and the discriminator, and thus makes this cooperation loop a virtuous circle. Specifically, as shown in Equation 1, the output probability p D (y = k|s b k ) that the sentence {s b k } belongs to class k is calculated from the embedding representation H 1 of {s b k }, p D (y = k|s b k ) = softmax(MLP(H)) (1) Theorem 1 shows that the generator at round t is encouraged to approximate the probability distribution given by the previous round discriminator. The proof is given in Appendix A. In particular, on the basis of a well-pretrained discriminator, the generated distribution of the generator can be guaranteed to be faithful to the real data distribution. Why Cooperative, Not Adversarial? (1) the generator is no longer a challenger to the discriminator that only provides negative data points to fool it, but now serves as a data augmenter to provide both positive and negative data points to enhance the discriminator; (2) the generator no longer updates its parameters through the policy gradients guided by the signals from the discriminator, but rather by utilizing the filtered data points to further improve its conditional generation quality. Note that by deliberately choosing the conditional generation paradigm along with the selection mechanism, we not only make the training more stable due to the different training goals, but also bypass the mode collapse problem of GANs (see Section 4.6 for further discussion). Besides, by iterating through the loops, our framework achieves self-consistency by honing the domain specificity of the generator and increasing the domain data exposure of the discriminator.

3.3. SENTENCE SEMANTIC MATCHING

We leverage the sentence semantic matching task (i.e. k = 2) as an example to demonstrate the effectiveness of our method. At this time, corresponding to Equation 2, k = 1/0 represents the positive/negative class, and filter Training D: We first randomly sample s a t from domain-related corpus C, and then input s a t to G t to generate s b t . Next, we feed sentence pair {s a t , s b t } into D t-1 to predict the label y t-1 , and filter {s a t , s b t , y t-1 } using threshold ϵ t-1 D . Finally, we train D t-1 on the selected data and pre-training data P to get an improved discriminator D t . Note that the filtered data have both positive and negative samples. The update process of D seeks to minimize the cross-entropy loss over all instances: L D (s, y) = 1 |s| |s| i=1 -[y i • log p D (y i = 1|s a i , s b i ) + (1 -y i ) • log(1 -p D (y i = 1|s a i , s b i ))] (4) Training G: We feed the generated sentence pairs {s a t , s b t } into D t to predict new labels y t , and then filter {s a t , s b t , y t } using threshold ϵ t G and additional rulesfoot_1 . Note that the filtered data has only positive samples. For the filtered data, we supplement it with the pre-training data P to update G t to G t+1foot_2 We also take out s b t from the filtered data and add them to the domain-related corpus. The expanded domain corpus are used to sample conditional sentences in the next round of generation. The update procedure of G employs the negative log-likelihood function over all instances: L G (s a , s b ) = - 1 |s b | |s b | t=1 log p G (s b t |s b <t , s a ) For the selection mechanism, we adopt the form ϵ t = m * t + λ after comparing the effects of different functions through experiments according to Equation 3. where m is the increment of the threshold for each round, λ is the initial threshold, and ϵ t is the threshold for rounds t. In the process of training G, since the sentences generated in each round are added to the domainrelated corpus, the source of domain-specific data is thus monotonically expanding by iterating the self-consistent learning loop. The formalized process is shown in Algorithm 1. Zero-Shot Baseline: We utilize the best-performing Chinese model RoBERTa-wwm-ext-large (Cui et al., 2020; 2021) and English model ALBERT-xxlarge-v2 (Lan et al., 2020) as the base discriminators in our self-consistent learning framework. Fine-Tune Baseline: We compare our model with several state-of-the-art semantic matching models, including the following. Chinese models MacBERT (Cui et al., 2020) , StructBERT (Wang et al., 2020) , RoFormer (Su et al., 2021; Su et al., 2022) , XLNet (Xu et al., 2020) , ELECTRA (Cui et al., 2020) , ALBERT (Lan et al., 2020) , RoBERTa (Cui et al., 2020; 2021) and English models BERT (Devlin et al., 2019b) , XLM-RoBERTa (XLM-R) (Conneau et al., 2020) , XLNet (Yang et al., 2019b) , ELECTRA (Clark et al., 2020) , ALBERT (Lan et al., 2020) , RoBERTa (Liu et al., 2019) . For a fair comparison, we use models that are as close in size as possible.

4.2.1. DATASETS

We conduct experiments on three Chinese semantic matching datasets AFQMC (Xu et al., 2020) , CHIP-STS (Zhang et al., 2022a) , Chinese-QQP (Wang et al., 2019) and an English semantic matching dataset MRPC (Wang et al., 2019) . More details about the datasets are given in Appendix E.

4.2.2. MODEL PRE-TRAINING

We adopt the well-established Transformer-XL (Dai et al., 2019) /OPT (Zhang et al., 2022b) architectures as the Chinese/English generator. To enable the generator to generate similar sentences with better linguistic quality, we pre-train a Transformer-XL model with 5.0 billion parameters and incrementally pre-train an OPT model with 2.7 billion parameters on the corpus consisting of plain texts and similar sentence pairs. Cleaned large-scale Chinese corpus WuDaoCorpora (Yuan et al., 2021) and English corpus WikiText (Merity et al., 2017) are used as plain texts. Similar sentence pairs that do not overlap with downstream datasets are used in the pre-training, and the designed prompts are employed to guide the generation of similar sentences. More details regarding model pre-training can be found in Appendix D.

4.3. ZERO-SHOT RESULTS

Table 1 shows how the F1 score of the discriminator varies with the number of self-consistent learning rounds on different datasets in the zero-shot task. According to Algorithm 1, the training is stopped when the discriminator no longer improves for two consecutive rounds. In addition, these four datasets are collected from different domains to further reflect the generality of our method in different domains. Specific training settings are recorded in Appendix F. The scores in the last line of Table 1 give the improvement of our discriminator in the last round relative to the first round. We can see that the F1 score gradually increases after each training round, eventually reaching a 10+ absolute percentage (AP) improvement. We believe what drives the improvement of the discriminator is the self-consistency, which it acquires with the generator step by step during the loop. To verify that the generator also improves after self-consistent training, we adopt Perplexity and Bertscore (Zhang et al., 2020) to measure the language fluency and the semantic similarity (i.e. domain specificity) respectively. For different generators in different rounds, we first select s a in similar sentence pairs from the same test set as the original sentences input, and generate similar sentences s b with greedy search. The reason for not using other sampling methods is to ensure reproducibility. Given the generated sentences, we introduce an additional GPT2foot_3 model to calculate the perplexity of generated similar sentences, and use a third-party libraryfoot_4 to calculate the bertscore between the original and generated similar sentences. The results are shown in Table 2 . We can see that the perplexity/bertscore of the last round in Table 2 has decreased/improved compared to the first round. Note that a lower perplexity indicates a more fluent sentence, while a higher bertscore indicates a more similar sentence. It suggests that after self-consistent training, the generator is gradually improved in language fluency and semantic similarity (i.e. domain specificity). The reason why the improvement of the generator is not as obvious as that of the discriminator is that the size of the generator is several times that of the discriminator, and the total number of training samples is limited. In Appendix G, the generated samples of the generator in different rounds are given to show the changes in the generation.

4.4. FINE-TUNE RESULTS

Our method not only works well in the zero-shot case, but also achieves good results in the full-data case. For the sake of a fair comparison, we reproduce several state-of-the-art semantic matching models on the four training sets, and their performances on the test sets are shown in Table 3 . Our approach uses the best-performing model on a single test set as the base discriminator for selfconsistent learning. The bold scores in the last line of Table 3 show that our method outperforms the SOTA results (shaded in gray) by 1 to 2 AP on all four test datasets, indicating the potential of self-consistent learning to further improve the model performance and establish new SOTA.

4.5. EVALUATING SELF-CONSISTENCY

In this section, we evaluate the consistency between the generator and the discriminator as the learning loop unfolds. We follow the same method used in Section 4.3 and use greedy search to generate similar sentences on the same test set. Then we take the confidence of the discriminator R D as the score of the discriminator, which is calculated for the original sentences s a and the generated similar sentences s b according to Equation 5.  R D = p D (y + |s a , s b ) (5) where y + represents a positive label. However, for the generator, to the best of our knowledge, there is no reliable way to measure how similar s a and s b are by using the generator itself. Therefore, to quantify this similarity, we introduce a third-party static model SimCSEfoot_5 to get the embedding representation A, B of sentences s a , s b . The cosine similarity R G between A and B is then calculated according to Equation 6to approximate the score of the generator. A, B = Encoder(s a ), Encoder(s b ) R G = A • B ∥A∥ 2 * ∥B∥ 2 (6) where A and B both represent the embedding representation at the [CLS] position. Note that the original sentence s a remains unchanged in each round, while the generated sentence s b changes. Finally, for the trained discriminator and generator in each round t, we can obtain two score distributions R t D and R t G correspondingly. According to Theorem 1, we draw the curves of KL divergence between R t D and R t G in each round for the four datasets: AFQMC, CHIP-STS, Chinese-QQP, and MRPC. As illustrated in Figure 2 , all the curves show a clear downward trend, indicating that the distance between the two score distributions decreases with the increase in the number of training rounds until a score consensus is reached. Table 4 shows the values of KL divergence in the first and last rounds. Numerically, it is more evident that the distances are significantly reduced on the four datasets.

4.6. EFFECT OF PRE-TRAINING DATA AND SELECTION MECHANISM

We perform ablation experiments on the pre-training data and the selection mechanism in the zeroshot case. As described in Section 4.1, the pre-training data is used to pre-train the generator and discriminator, completely independent of the experimental datasets in self-consistent training. To explore the influence of pre-training data on self-consistent training, we no longer add it in each round when training the discriminator, and only the generated data is used. But when the generator is trained, pre-training data is still retained to prevent language degeneration and lack of expressive diversity of the generation. The result of removing pre-training data is shown as the green curves in Figure 3 . With all other training parameters being the same, after the same number of training rounds, the discriminator is slightly worse compared to the original method (red curves in Figure 3 ). Table 4 : The KL divergence in the first and last rounds. However, the green curves maintain an upward trend and are very close to the red curves in all datasets except CHIP-STS. This shows that the generated data plays a key role in continuously improving the discriminator, while the pre-training data has a limited role. For future work, we will explore the effectiveness of our self-consistent learning framework on more NLP tasks, since the framework is straightforward and has no additional requirements on generators and discriminators.

6. REPRODUCIBILITY STATEMENT

We now discuss the efforts that have been made to ensure the reproducibility of our work. We have packaged the executable code and data into supplementary materials, which can be downloaded and run directly. In addition, we also provide detailed experimental parameters in the appendix to reproduce the experimental results. All datasets and code platforms (Pytorch) we use are public. (quadratic and linear). For the fairness of comparison, we keep the maxima and minima the same for all functions(except for the constant threshold), and the values are given in Appendix F. In addition, the number of training rounds for different functions on the same dataset remains the same. In the results below, the best results and the second-best results are bold and underlined, respectively. As can be seen from the Table 5 , in the zero-shot setting, the chosen linear function outperforms the other functions, and all the filter functions show an averaging 10+ AP improvement relative to the baseline. Therefore, the self-consistent learning framework makes it easy to choose a certain threshold function and perform well, and the results are not so sensitive to the choice of the functions. In general, we can intuitively see that all functions show a significant increase relative to the starting point. Table 6 shows the effects of different filter functions in the fine-tune experiment. It can be seen that all functions have a 1 ∼ 2 AP increase relative to the baseline, and the chosen linear function achieves the best performance on all datasets except Chinese-QQP.

C CONTRASTIVE EXPERIMENTS WITH ADVERSARIAL TRAINING

In this section, We further demonstrate the superiority of the cooperative approach by comparing the results with adversarial experiments. All experimental settings independent of the training method remain the same in the adversarial training. During the experiments, the generator is no longer trained using the samples filtered by the discriminator, but the rewards passed by the discriminator assist the training. All generated samples are treated as negative samples when training the discriminator. Specifically, G takes the prompt ' "s a " is similar to " ' and the first M tokens of s b as input to get M sentence pairs < s a , s b m >, where m is from 1 to M . Note that we repeat the process of generating sentences N times to reduce the negative impact caused by the large variance of the rewards. 7 The sentence pair g n m of the m-th token at the n-th time is formalized as g n m ←< s a , s b m >= G θ (s b m |s b <m , s a ; N ) Once the M * N sentence pairs g n m are generated, they are passed as input to the D to obtain the probability score Q n m for each of them. We take the average of Q n m over N as the reward Qm corresponding to the m-th token. If the sentence length of s b is greater than M , the rewards of the remaining tokens are all the same as those of the M -th token. Taking the m-th token as an example, the rewards Qm can be formalized as QG θ D ϕ (m) = 1 N N n=1 D ϕ (g n m ) m ≤ M Q(M ) m > M Therefore, the objective function for training the generator G is, L G (s a , s b ) = - 1 |s b | |s b | t=1 log(p G (s b t |s b <t , s a ) * Qt ) The loss function of training the discriminator remains the same as Equation 4, but differing from cooperative training, the generated samples are regarded as negative samples to the discriminator, and the training target for the discriminator can be given by min ϕ -E X∼pdata [log D ϕ (X)] -E X∼p G θ [log (1 -D ϕ (X))] The results of zero-shot and fine-tune on the four datasets are shown in Tables 7 and 8 . As can be seen from Table 7 , in the zero-shot setting, training in an adversarial manner does not give any improvement over the baseline. Because the initial discriminator in the zero-shot setting is very weak in distinguishing positive and negative samples, it is reasonable to believe that if all generated samples are considered negative samples from the very beginning, it is difficult for the discriminator to know how to distinguish positive samples. As a result, the F1 scores on both AFQMC and CHIP-STS datasets end up being 0, while the scores on the Chinese-QQP and MRPC datasets fluctuate intensively with the number of rounds, which further validates the instability of the adversarial training in the zero-shot setting. For the fine-tune experiments, Table 8 shows that training in an adversarial manner can slightly improve the performance on the Chinese-QQP and MRPC datasets, but is still worse than the cooperative training. On the AFQMC and CHIP-STS dataset, adversarial training makes it even worse relative to the baseline. It is worth noting that the whole process of adversarial training is so unstable and it is easy to collapse after a few training rounds. The training parameters of fine-tuning are shown in Table 11 .

G GENERATE SAMPLES

The generators use nucleus sampling (Holtzman et al., 2020) to generate similar sentences. Generated examples in English are shown in Table 12 and in Chinese in Table 13 .



We followReimers & Gurevych (2019) and use the embedding representation of CLS-token as the sentence representation H . The additional rules are used to exclude sentences which are too long, too short, or too similar according to the longest common substring algorithm. Note that the pre-training data P is used to warm up G and D. Although pre-training data is not mandatory in subsequent training, we empirically found that including it when training G can prevent language degeneration and improve downstream performances. Wenzhong-GPT2-110M(Wang et al., 2022) for Chinese data, and GPT2-base(Radford et al., 2019) for English data. https://pypi.org/project/bert-score/ We use SimCSE-BERT-base to calculate scores on Chinese datasets and sup-SimCSE-BERT-base-uncased on English datasets.(Gao et al., 2021) To compare the effect of different filter functions on the final result, we use four type of functions, including oscillatory function (cosine), constant function and monotonically increasing functions In practice, we take M = 5, N = 5 for ease of calculation.



Figure 1: Overview of the flow chart for the SCL framework.

represents the filter function in round t for the positive/negative class respectively. First, let us introduce the formal definition of this task. Given two sentences s a = {w a 1 , w a 2 , ..., w a ℓa } and s b = {w b 1 , w b 2 , ..., w b ℓ b }, where w a i and w b j represent the i-th and j-th tokens in the sentences, and ℓ a and ℓ b indicate the length of s a and s b . The goal of this task is to learn a discriminator D to precisely predict the label y = D(s a , s b ), where y ∈ Y = {0, 1} indicates whether the two sentences are similar.In our task, G is trained to generate a similar sentence s b from any given sentence s a and D is trained to predict label y from any given sentence pair {s a , s b }. As demonstrated in Figure1, there are mainly two training processes in the entire framework: fix G to train D and fix D to train G. We introduce the two training procedures in detail with the t-th round training.

Self-consistent Learning (SCL) Require: Generator G; Discriminator D; Domain-Related Corpus C; Pre-training Data P . 1: Initialize G 0 and D 0 with pre-trained language models; 2: Warm-up G 0 and D 0 with pre-training data P to get G 1 and D 1 ; 3: for each round i ∈ [1, n] do 4: if Two consecutive rounds of discriminator still improve then 5: Generate similar sentences s b ∼ p G i (•|s a ) from sampled sentences s a from C; 6: Predict pseudo-labels y i ∼ p D i (•|s a , s b ); 7:Use threshold ϵ i D to select data on {s a , s b , y i } to train D i+1 ; 8:Predict pseudo-labelsy i+1 ∼ p D i+1 (•|s a , s b ); 9:Use threshold ϵ i G and additional rules to select data on {s a , s b , y i+1 } to train G i+1 ; , the pre-training datasets are used to warm up the discriminator and generator, and the domain-related corpus is a set of independent sentences. To avoid label leakage, none of the training datasets participate in the pre-training of the generator and discriminator. In other words, the datasets in pre-training and self-consistent training are two non-overlapped datasets.

Figure 2: The KL Divergence between the score distributions of the Discriminator and the Generator.

Figure 3: Results of ablation experiments on pre-training data and selection mechanism. Results of the proposed method, results without pre-training data, and results without the selection mechanism are given in red, green, and blue, respectively.

Figure 4: Results of contrast experiments on Cosine(green), Constant(orange), Quadratic(blue) and Linear(red) function.

Figure 4 dipicts the comparison results in each round. The linear function (red line) is significantly better than the other functions on both CHIP-STS and MRPC datasets. In the AFQMC and Chinese-QQP datasets, the quadratic function (blue line) is slightly more effective than the linear function.In general, we can intuitively see that all functions show a significant increase relative to the starting point.

Zero-Shot Performance of the Discriminator. (F1 score (%))

Zero-Shot Performance of the Generator.

F1 Score(%) of Different Discriminators on the Test Datasets.

Performance of Different Filter Functions in Zero-Shot Setting. (F1 Score(%))

Performance of Different Filter Functions in Fine-Tune Setting. (F1 Score(%))

Statistics of Experimental Datasets AFQMC CHIP-STS Chinese-QQP MRPC(en) training parameters of zero-shot are shown in Table 10. The three thresholds are used to select positive and negative examples for training the discriminator and positive examples for training the generator, respectively. We adopt cosine annealing learning rate decay strategy during training.

Parameter Settings of Zero-Shot.

D MODEL DETAILS

The 5.0B Transformer-XL is pre-trained on 32 A100s with 40G memory for 45 days, the batch size is set to 32*8=256. After running 445k steps, the final validation loss reduces to about 2.4. The 2.7B OPT is incrementally trained on the basis of the open-source model.During the pre-training of the generator model, we utilize the memory-cache mechanism of Transformer-XL and design a special attention mask to concatenate the multiple input sentences into one sample, to reduce the number of the padding token in a batch and therefore increase the number of effective tokens. To make the generation more robust, we add noise to the original sentences by randomly replacing or discarding tokens with a 5% probability. In addition, the prompts that we use for Chinese generation and English generation are as follows,• Chinese prompt: "s a "的相似句是"s b " (en: A similar sentence to "s a " is "s b ".)• English prompt: "s a " is similar to "s b "When training the discriminator, following the usage of special tokens in BERT (Devlin et al., 2019a) , we use [SEP ] to concatenate two sentences and take the embedding at the [CLS] position to represent the whole sentence to predict the label. Moreover, we utilize the mask method in BERT to randomly mask 15% of the input tokens.

E DATASET DETAILS

The statistics of the experimental datasets are reported in Table 9 .Other Chinese datasets (LCQMC (Liu et al., 2018) , OPPO, PAWS-X-zh (Yang et al., 2019a) , BQ (Chen et al., 2018) , CCKS, Chinese-STS-B (Wang et al., 2019) ) and English datasets (QQP (Wang et al., 2019) , STS-B (Wang et al., 2019) , PAWS-X-en (Yang et al., 2019a) ) are collected and used as the corpus of similar sentence pairs for pre-training the generator.The Chinese-QQP dataset contains 9000 pieces of data randomly selected and translated from the English QQP dataset, which is then divided into training set and test set in a ratio of 3:2. first round: the market is flat, the US is flat, Europe is flat, Japan is flat, ... Malaysia is flat, Philippines is flat, Vietnam is flat, Singapore is last round: Hong Kong was unchanged, Australia, Singapore and South Korea were down 0.2-0.4 percent. origin: Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.first round: Microsoft is a threat to Linux. last round: Ballmer has been vocal in the past about the threat Linux poses to Microsoft.origin: Last year, Congress passed similar, though less expensive, buyout legislation for peanut farmers, ending that Depression-era program. first round: The government is going to buy out your farm and give you a check for $1,000,000.00. You can't refuse. last round: Congress passed a similar bill last year to end a Depression-era program for peanut farmers, ending that program. origin: Licensing revenue slid 21 percent, however, to $107.6 million. first round: The company's net income fell to $1.1 million, or 2 cents per share, in the quarter ended Sept. 30, from $1.3 million, or 3 cents per share, a year earlier.last round: Licensing revenue fell 21 percent to $107.6 million in the third quarter. origin: SCO says the pricing terms for a license will not be announced for weeks. first round: we're not going to tell you how much it costs to buy a car, but we're going to tell you how much it costs to rent one. last round: The pricing terms for a license for the SCO software will not be announced for several weeks. first round：为什么宇宙中没有极限的存在 (Why is there no limit in the universe) last round：为什么没有人知道无限和有限之间的区别? (Why does no one know the difference between infinite and finite?)

