LEARNING BETTER STRUCTURED REPRESENTATIONS USING LOW-RANK ADAPTIVE LABEL SMOOTHING

Abstract

Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found widespread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.

1. INTRODUCTION

Ever since Szegedy et al. (2016) introduced label smoothing as a way to regularize the classification (or output) layer of a deep neural network, it has been used across a wide range of tasks from image classification (Szegedy et al., 2016) and machine translation (Vaswani et al., 2017) to pre-training for natural language generation (Lewis et al., 2019) . Label smoothing works by mixing the one-hot encoding of a class with a uniform distribution and then computing the cross-entropy with respect to the model's estimate of the class probabilities to compute the loss of the model. This prevents the model being too confident about its predictions -since the model is now penalized (to a small amount) even for predicting the correct class in the training data. As a result, label smoothing has been shown to improve generalization across a wide range of tasks (Müller et al., 2019) . More recently, Müller et al. (2019) further provided important empirical insights into label smoothing by showing that it encourages the representation learned by a neural network for different classes to be equidistant from each other. Yet, label smoothing is overly crude for many tasks where there is structure in the label space. For instance, consider task-oriented semantic parsing where the goal is to predict a parse tree of intents, slots, and slot values given a natural language utterance. The label space comprises of ontology (intents and slots) and natural language tokens and the output has specific structure, e.g., the first token is always a top-level intent (see Figure 1 ), the leaf nodes are always natural language tokens and so on. Therefore, it is more likely for a well trained model to confuse a top-level intent with another top-level intent rather than a natural language token. This calls for models whose uncertainty is spread over related tokens rather than over obviously unrelated tokens. This is especially important in the few-shot setting where there are few labelled examples to learn representations of novel tokens from. Our contributions. We present the first rigorous theoretical analysis of label smoothing by obtaining PAC Bayesian generalization bounds for a closely related (upper-bound) loss function. Our analysis reveals that the choice of the smoothing distribution affects generalization, and provides a recipe for tuning the smoothing parameter. Then, we develop a simple yet effective extension of label smoothing: low-rank adaptive label smoothing (LORAS), which provably generalizes the former and adapts to the latent structure that is often present in the label space in many structured prediction problems. We evaluate LORAS on three semantic parsing data sets, and a semantic parsing based question-answering data set, using various pre-trained representations like RoBERTa Liu et al. ( 2019) and BART Lewis et al. ( 2019). On ATIS (Price, 1990) and SNIPS (Coucke et al., 2018) , LORAS achieves average absolute improvement of 0.6% and 0.9% respectively in exact match of logical form over vanilla label smoothing across different pre-trained representations. In the few-shot setting using the TOPv2 data set (Chen et al., 2020)foot_0 , LORAS achieves an accuracy of 74.1% on average over the two target domains -an absolute improvement of 2% over using vanilla label smoothing and matching the state-ofthe-art performance in Chen et al. ( 2020) despite their use of a much more complex meta-learning method. Lastly, in the transfer learning setting on the Overnight data set (Wang et al., 2015) , LO-RAS improves over vanilla label smoothing by 1% on average on the target domains. Furthermore, LORAS is easy to implement and train and can be used in conjunction with any architecture. We show that unlike vanilla label smoothing, LORAS effectively solves the neural network overconfidence problem for structured outputs where it produces more calibrated uncertainty estimates over different parts of the structured output. As a result, LORAS reduces the test set expected calibration error by 55% over vanilla label smoothing on the TOPv2 data set. We present an efficient formulation of LORAS which does not increase the model size, while requiring only O (K) additional memory during training where K is the output vocabulary size (or the number of classes in the multi-class setting).

2. PRELIMINARIES

We consider structured prediction formulated as a sequence-to-sequence (seq2seq) prediction problem. We motivate our method through semantic parsing where the input x is an natural language utterance and the output y is a serialized tree that captures the semantics of the input in a machine understandable form (see Figure 1 for an example). Specifically, given input output pairs (x, y) where x = (x i ) m i=1 and y = (y i ) n i=1 are sequences, let φ(x, y 1:t-1 ) be the representation of the input and output tokens up to time step t -1 modeled by a neural network. At time step t the probability of the t-th output token is given by: softmax(Wφ(x, y 1:t-1 )), where W ∈ R K×d are the output projection weights (last layer) of the neural network and K is the vocabulary size. The representation and the output projections are learned by minimizing the negative log-likelihood of the observed samples S.



TOPv2 data set is a newer version of the TOP data set introduced in (Gupta et al., 2018) containing 6 additional domains, which is particularly suitable for benchmarking few-shot semantic parsing methods.



Figure 1: Top: Semantic parse tree of the utterance "Driving directions to the Eagles game". Bottom: Serialized tree. IN: stands for intents while SL: stands for slots (See Gupta et al., 2018).

