LEARNING BETTER STRUCTURED REPRESENTATIONS USING LOW-RANK ADAPTIVE LABEL SMOOTHING

Abstract

Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found widespread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.

1. INTRODUCTION

Ever since Szegedy et al. (2016) introduced label smoothing as a way to regularize the classification (or output) layer of a deep neural network, it has been used across a wide range of tasks from image classification (Szegedy et al., 2016) and machine translation (Vaswani et al., 2017) to pre-training for natural language generation (Lewis et al., 2019) . Label smoothing works by mixing the one-hot encoding of a class with a uniform distribution and then computing the cross-entropy with respect to the model's estimate of the class probabilities to compute the loss of the model. This prevents the model being too confident about its predictions -since the model is now penalized (to a small amount) even for predicting the correct class in the training data. As a result, label smoothing has been shown to improve generalization across a wide range of tasks (Müller et al., 2019) . More recently, Müller et al. ( 2019) further provided important empirical insights into label smoothing by showing that it encourages the representation learned by a neural network for different classes to be equidistant from each other. Yet, label smoothing is overly crude for many tasks where there is structure in the label space. For instance, consider task-oriented semantic parsing where the goal is to predict a parse tree of intents, slots, and slot values given a natural language utterance. The label space comprises of ontology (intents and slots) and natural language tokens and the output has specific structure, e.g., the first token is always a top-level intent (see Figure 1 ), the leaf nodes are always natural language tokens and so on. Therefore, it is more likely for a well trained model to confuse a top-level intent with another top-level intent rather than a natural language token. This calls for models whose uncertainty is spread over related tokens rather than over obviously unrelated tokens. This is especially important in the few-shot setting where there are few labelled examples to learn representations of novel tokens from.

