GRADUATED NON-CONVEXITY FOR ROBUST SELF-TRAINED LANGUAGE UNDERSTANDING Anonymous

Abstract

Self-training has been proven to be an efficient strategy for unsupervised finetuning of language models using unlabeled data and model-generated pseudolabels. However, the performance of self-trained models is unstable under different conditions of the training and evaluation data, influenced by both data distribution and pseudo-label accuracy. In this work, we propose an outlier robust self-training method based on graduated non-convexity (GNC) to mitigate the problem. We construct self-training as a non-convex optimization problem with outlier training examples. The models are self-trained with robust cost functions according to Black-Rangarajan Duality. The algorithm learns slack variables as the loss weights for all training samples. The slack variables are used to calibrate the loss items during training to update the model parameters. The calibrated loss items lead to more robust self-trained models against different training and evaluation data and tasks. We conduct experiments on few-shot natural language understanding tasks with labeled and unlabeled data examples. Experimental results show that the proposed loss calibration method improves the performance and stability of self-training on different tasks, benefiting the robustness against incorrect pseudo-labels, imbalanced training data, overfitting, and adversarial evaluation data.

1. INTRODUCTION

Recent developments in large-scale pretrained language models has significantly improved the performance of natural language understanding tasks (Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020; He et al., 2020; Brown et al., 2020) . After pretraining, these models are typically fine-tuned on task-specific training data with human-generated labels. However, human-generated labels are not available (or large enough) for all tasks of interest. When there is a significant number of unlabeled examples, a pretrained model can utilize techniques such as self-training to improve performance (He et al., 2019; Zoph et al., 2020 ). An issue with this approach is that the generated pseudo-labels can be noisy (Zhao et al., 2021; Lang et al., 2022; Zhang & Zhou, 2011) , since a pretrained model can make wrong predictions for unseen examples. We propose a learning-based loss calibration strategy that tunes the loss weights for each data example during self-training. In this approach, a subset of training data are assigned low weights for calculating the overall training loss, leading to less influence in parameter updates. To learn the loss weights, we employ the graduated non-convexity (GNC) strategy (Yang et al., 2020) based on Black-Rangarajan Duality (Black & Rangarajan, 1996) . Under the fully supervised and outlier-free setting, the model parameters are updated to optimize the total cost of all training samples, while under the self-training setting, the training set usually contains outliers with wrong pseudo-labels. We thus optimize a robust cost function L r = i ρ[l(ŷ i , x i ); θ] where l(•) is the selected loss function and Robustness against different training and evaluation data. The difference within different training and evaluation corpora leads to significant performance gaps between the two because of the accuracy of pseudo-labels and the difficulty/coverage of data examples. We find that the proposed method is robust against different tasks, training corpora, and adversarial evaluation data, outperforming label-free self-training strategies and label-dependent few-shot learning baselines. D train = {(x i , y i )|i ∈ [0, N ]}

2. RELATED WORK

Recent pretrained language models are trained with a self-supervised learning strategy, including predicting masked words (Devlin et al., 2018; Liu et al., 2019; He et al., 2020) , predicting next words (Brown et al., 2020; Raffel et al., 2020; Lewis et al., 2019) , predicting the correctness of words (Clark et al., 2020) , and representing sentences (Gao et al., 2021; Chuang et al., 2022) . Besides pretraining on corpora without additional human-generated labels, textual entailment corpora, including SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) 2021) indicated that even state-or-the-art language models make mistakes on adversarial evaluation examples. On the other hand, even for language models pretrained on large corpora, their performance on downstream tasks might not be satisfying if not enough task-specific training data is presented (Schick & Schütze, 2021; Le Scao & Rush, 2021). There is little work simultaneously dealing with the limitations of small training sets and missing human-generated labels. 3 BACKGROUND: GRADUATED NON-CONVEXITY Graduated non-convexity (GNC) is a popular optimization method in vision (Blake & Zisserman, 1987) and machine learning (Rose, 1998; Mobahi & Fisher, 2015) . The core idea is optimizing a robust cost function instead of a standard loss function, as shown in Equation 1. The robust cost function ρ(•) is adjustable give a coefficient µ. When µ is large enough, ρ(l(y i , x i )) is approximately convex. During the optimization, the value of µ is decreased at each step and the optimization prob-



are also used for both sentence-level pretraining (Reimers & Gurevych, 2019) and downstream tasks; for example relation extraction(Obamuyide & Vlachos, 2018; Yin et al., 2019)  and fact checking(Thorne & Vlachos, 2018).While most self-supervised learning methods are agnostic to downstream tasks, self-training models learn to handle any downstream task by learning from synthetic data or pseudo-labels that are automatically generated by the same or other models. Among recent research under the context of self-training, back translation(He et al., 2019)  and data augmentation methods(Xie et al., 2020;  Chen et al., 2020)  use synthetic texts as inputs, pseudo labeling methods(Zoph et al., 2020; Zou  et al., 2019)  use synthetic labels as training targets, and self-trained question answering methods(Bartolo et al., 2021; Luo et al., 2022)  use synthetic questions as inputs and synthetic answers as target outputs. Recent studies have discussed the efficiency and robustness of language models from different aspects, including different settings on training and evaluation data. The authors of Zang et al. (2019); Jin et al. (2020); Bartolo et al. (2020); Wang et al. (

the algorithm to predict. Few application examples have been seen in other domains and the approach has not been applied in the area of self-training. The difficulty is that unlike prediction, an optimal self-training loss on training examples does not guarantee minimal loss on evaluation sets. In fact, adding noise in self-training can produce better performance, for example withDropout (He et al., 2019)  and confidence regularization(Zou et al., 2019).In this work, we propose GNC self-training that produces robust performance against incorrect pseudo labels, imbalanced training data, overfitting, and adversarial evaluation data. The core idea is penalizing outliers with wrong pseudo labels and over-confident training samples by assigning low loss weights. As a result, such training samples contribute less to the updated model parameters than other samples. In this work we make the following contributions, Applying graduated non-convexity (GNC) in self-training. To the best of our knowledge, this work is the first attempt of applying GNC for self-trained language models. While traditional GNC methods treat high-loss observations as outliers, we noticed that in self-training, outliers include both high-loss and over-confident examples. We propose a shifted robust cost function to deal with over-confident cases, and the loss weights are tuned during self-training.

