LEARNING BETTER STRUCTURED REPRESENTATIONS USING LOW-RANK ADAPTIVE LABEL SMOOTHING

Abstract

Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found widespread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.

1. INTRODUCTION

Ever since Szegedy et al. (2016) introduced label smoothing as a way to regularize the classification (or output) layer of a deep neural network, it has been used across a wide range of tasks from image classification (Szegedy et al., 2016) and machine translation (Vaswani et al., 2017) to pre-training for natural language generation (Lewis et al., 2019) . Label smoothing works by mixing the one-hot encoding of a class with a uniform distribution and then computing the cross-entropy with respect to the model's estimate of the class probabilities to compute the loss of the model. This prevents the model being too confident about its predictions -since the model is now penalized (to a small amount) even for predicting the correct class in the training data. As a result, label smoothing has been shown to improve generalization across a wide range of tasks (Müller et al., 2019) . More recently, Müller et al. (2019) further provided important empirical insights into label smoothing by showing that it encourages the representation learned by a neural network for different classes to be equidistant from each other. Yet, label smoothing is overly crude for many tasks where there is structure in the label space. For instance, consider task-oriented semantic parsing where the goal is to predict a parse tree of intents, slots, and slot values given a natural language utterance. The label space comprises of ontology (intents and slots) and natural language tokens and the output has specific structure, e.g., the first token is always a top-level intent (see Figure 1 ), the leaf nodes are always natural language tokens and so on. Therefore, it is more likely for a well trained model to confuse a top-level intent with another top-level intent rather than a natural language token. This calls for models whose uncertainty is spread over related tokens rather than over obviously unrelated tokens. This is especially important in the few-shot setting where there are few labelled examples to learn representations of novel tokens from. Figure 1 : Top: Semantic parse tree of the utterance "Driving directions to the Eagles game". Bottom: Serialized tree. IN: stands for intents while SL: stands for slots (See Gupta et al., 2018) . Our contributions. We present the first rigorous theoretical analysis of label smoothing by obtaining PAC Bayesian generalization bounds for a closely related (upper-bound) loss function. Our analysis reveals that the choice of the smoothing distribution affects generalization, and provides a recipe for tuning the smoothing parameter. Then, we develop a simple yet effective extension of label smoothing: low-rank adaptive label smoothing (LORAS), which provably generalizes the former and adapts to the latent structure that is often present in the label space in many structured prediction problems. We evaluate LORAS on three semantic parsing data sets, and a semantic parsing based question-answering data set, using various pre-trained representations like RoBERTa Liu et al. (2019) and BART Lewis et al. (2019) . On ATIS (Price, 1990) and SNIPS (Coucke et al., 2018) , LORAS achieves average absolute improvement of 0.6% and 0.9% respectively in exact match of logical form over vanilla label smoothing across different pre-trained representations. In the few-shot setting using the TOPv2 data set (Chen et al., 2020) foot_0 , LORAS achieves an accuracy of 74.1% on average over the two target domains -an absolute improvement of 2% over using vanilla label smoothing and matching the state-ofthe-art performance in Chen et al. (2020) despite their use of a much more complex meta-learning method. Lastly, in the transfer learning setting on the Overnight data set (Wang et al., 2015) , LO-RAS improves over vanilla label smoothing by 1% on average on the target domains. Furthermore, LORAS is easy to implement and train and can be used in conjunction with any architecture. We show that unlike vanilla label smoothing, LORAS effectively solves the neural network overconfidence problem for structured outputs where it produces more calibrated uncertainty estimates over different parts of the structured output. As a result, LORAS reduces the test set expected calibration error by 55% over vanilla label smoothing on the TOPv2 data set. We present an efficient formulation of LORAS which does not increase the model size, while requiring only O (K) additional memory during training where K is the output vocabulary size (or the number of classes in the multi-class setting).

2. PRELIMINARIES

We consider structured prediction formulated as a sequence-to-sequence (seq2seq) prediction problem. We motivate our method through semantic parsing where the input x is an natural language utterance and the output y is a serialized tree that captures the semantics of the input in a machine understandable form (see Figure 1 for an example). Specifically, given input output pairs (x, y) where x = (x i ) m i=1 and y = (y i ) n i=1 are sequences, let φ(x, y 1:t-1 ) be the representation of the input and output tokens up to time step t -1 modeled by a neural network. At time step t the probability of the t-th output token is given by: softmax(Wφ(x, y 1:t-1 )), where W ∈ R K×d are the output projection weights (last layer) of the neural network and K is the vocabulary size. The representation and the output projections are learned by minimizing the negative log-likelihood of the observed samples S. Label Smoothing. The idea behind label smoothing is to uniformly smooth the one-hot vector before computing the cross-entropy with the learned distribution. Let y t = (1 [y t = j]) K j=1 denote the one-hot encoding of t-th output token and p t = (p(y t = j | x, y 1:t-1 )) K j=1 denote the distribution over the vocabulary modeled by the neural network. Then, setting y LS t = (1 -α)y + α 1 /K, we compute H(y LS t , p t ), the cross-entropy between y LS t and p t , as our loss function: H(y LS t , p t ) = -(1 -α) K j=1 y t,j log p t,j - α K K j=1 log p t,j .

3. THEORETICAL MOTIVATION FOR LABEL SMOOTHING

In this section we look at why training neural networks with soft-targets can help with generalization. To simplify exposition we will consider a multi-class classification setting where we have input output pairs (x, y) and y ∈ [K]. As first described by Müller et al. (2019) , label smoothing encourages the representation of the input φ(x) to be close to projection weight (w) for the correct class and at the same time be equidistant from the weights of all other classes. We will formalize this by obtaining rigorous generalization bounds for label smoothing. Towards that end we will fix the input representation φ(x) ∈ R d with φ(x) 2 ≤ 1 and focus on the classification layer weights W ∈ R K×d . For a noise distribution n(•), which is uniform for standard label smoothing, an upper-bound on the loss (1) is given as follows: L(S, W) = (x,y)∈S l(x, y; W, α) + α 2 n -Wφ 2 2 (2) where n = (n(i)) K i=1 is the vectorized noise distribution and φ = x∈S φ(x) is the sum of input representations in the training set, and l(x, y; W, α) is the re-scaled negative log-likelihood of the observed data S where the linear term is scaled by α. The upper bound is obtained by ignoring the norm of Wφ from the objective (see Appendix A.1 for a derivation). The objective given by (2) is essentially penalized negative log-likelihood with a penalty term that encourages the aggregated (un-normalized) class scores to be close to the noise distribution. Unlike standard weight regularization, however, the regularization term depends on both the weight W and the inputs x ∈ S. Therefore, existing theory for regularized loss minimization doesn't apply to this case and we invoke PAC-Bayesian theory to analyze the above. As is standard in PAC-Bayesian analysis, we consider minimizing the empirical loss L(S, W) around a small neighborhood of the weight W. For a posterior distribution Q on the weights W, which depends on the sample S, we consider the following empirical risk and the expected risk and their respective minimizers. R(Q(W)) = E W ∼Q(W) L(S, W) (empirical risk) R(Q(W)) = E W ∼Q(W) E (x,y) [l(x, y; W , α)] (expected risk) W ∈ argmin W ∈R K×d R(Q(W)) (empirical minimizer) W ∈ argmin W ∈R K×d R(Q(W)) (true minimizer) The following theorem, whose proof we defer to Appendix A.2, bounds the risk of the minimizer of the label smoothing loss (on the sample S) in terms of the risk of the minimizer of the expected negative log-likelihood (under the data distribution). Theorem 1 (PAC-Bayesian generalization bound). Set the distribution Q(W), parameterized by W with bounded induced norm, over the weights W to be such that each column W * ,i is sampled i.i.d. from the Gaussian distribution N (Wφ, I). If α = 2d / √ N , where N = |S| is the number of samples, then with probability at least 1 -δ the generalization error is bounded as follows: R(Q( W)) -R(Q(W)) ≤ 2d √ N n -Wφ 2 2 + 1 √ N log 2e b 2 8 δ . It is important to note that the generalization error depends on the number of classes K through the term n -Wφ 2 which grows as Θ(K). This is due to the fact that the label smoothing objective regularizes the class scores (Wφ) as opposed to regularizing the output layer weights (W) directly which would result in the generalization error depending on W 2 F which is Θ(dK). The above result also prescribes how to set the smoothing constant α which is typically chosen to be the constant 0.1 in practice -as the number of samples N → ∞, α → 0 and less smoothing of the hard targets is needed to achieve generalization. Furthermore, the generalization error also depends on how close the aggregated un-normalized class scores (Wφ) of the true minimizer on the training set S are to the noise distribution. Therefore, choosing a more informative smoothing distribution, as opposed to a uniform distribution, should lead to better generalization.

4. LOW-RANK ADAPTIVE LABEL SMOOTHING

Motivated by the above result, we propose to use a more informative noise distribution than the uniform distribution to smooth the hard targets. The idea behind low-rank adaptive label smoothing (LORAS) is to learn the noise distribution that is to be mixed in with the one-hot target jointly with the model parameters. Specifically, we consider noise distributions n(• | y, S) parameterized by the true label y and a symmetric matrix S ∈ R K×K . Then, in LORAS we set y LORAS t = (1 -α)y t + αn(y t , S) where n(y t , S) = softmax(y t S) and compute the cross-entropy as follows: H(y LORAS t , p t ) = - K j=1 y LORAS t log p t = -(1 -α) K j=1 y t,j log p t,j -α K j=1 exp(S yt,j ) Z(S yt, * ) log p t,j , where S yt, * denotes the y t -th row of the matrix S and Z(S yt, * ) = K l=1 exp(S yt,l ) is the partition function. Thinking of n j for all j ∈ [K] as Lagrange multipliers, the matrix S serves to relax the constraint that the representation of an output token be equally close to the weight vectors for all other tokens. For instance, in the semantic parsing task where the vocabulary comprises of ontology tokens (intents and slots) and utterance tokens, it is unreasonable to treat everything on equal footing since certain groups of tokens (e.g. ontology tokens) might more similar to each other than others. For the i-th output token S i,j determines "how close" the representation of the i-th token should be to the weight w j of the j-th token, with larger (resp. smaller) values of S i,j forcing the representation to be closer (resp. farther). With this interpretation, the matrix S can also be thought of as imposing a notion of similarity over vocabulary tokens with S i,j determining how similar the i-th token is to the j-th token. Low-rank assumption. The size of the vocabulary in many NLP applications is often large containing tens of thousands of tokens. Since the size of S is quadratic in the size of the vocabulary, it is prohibitive to work the matrix S directly. Instead, we assume that S has low rank structure, i.e., there exists an L ∈ R K×r with r K such that S = LL . This is a reasonable assumption in many settings, especially in our semantic parsing application, where a natural group structure exists in the vocabulary comprising of ontology and utterance tokens. Next, we formulate (3) in terms of L thereby eschewing the matrix S altogether. Under the above assumption we have: y LORAS t = (1 -α)y t + αsoftmax(L yt, * L ). Note that L yt, * L ∈ R K and we don't need to explicitly construct the matrix S which can be large. Entropy constraint and dropout. To encourage discovery of latent structure in the label space, we minimize H(y LORAS t , p t ) subject to an entropy constraint on softmax(S k, * ). This forces the noise distribution to be farther away from the uniform distribution. Also note that the low-rank assumption ensures that S doesn't become close to a diagonal matrix which would reduce adaptive label smoothing to the case of training with hard targets. To see this, assume that for some i S i,i S i,j ∀j = i. Then softmax(S i, * ) ≈ e i , where e i is a vector of all zeros except for a 1 at the i-th index, thereby reducing adaptive label smoothing to no label smoothing. The low rank assumption ensures that matrix S does not become diagonally dominant. Lastly, to avoid the matrix L from overfitting to the training data, we add dropout on L. The final objective that adaptive label smoothing optimizes is given in the following, where H(•, •) denotes cross-entropy while H(•) denotes entropy: L = dropout(L; q) L LORAS (S) = (x,y)∈S |y| t=1 H(y LORAS t , p t ) + ηH(softmax( L yt, * L )) = (x,y)∈S |y| t=1 H(y LORAS t , p t ) -η( L yt, * L ) softmax( L yt, * L ) + η logsumexp( L yt, * L ), where H(y LORAS t , p t ) = -(1 -α) K j=1 y t,j log p t,j -α K j=1 exp( L yt, * ( L j, * ) ) Z( L yt, * L ) log p t,j , η ≥ 0 is a hyper-parameter that controls how far the noise distributions are from the uniform distribution, with larger values encouraging more peaked (low-entropy) noise distributions, and q ∈ [0, 1) is the dropout parameter. The matrix L is initialized to be a matrix of all ones, and we jointly learn the matrix L along with the model parameters. After training, the matrix L is discarded. Next, we show that the above formulation of adaptive label smoothing strictly generalizes label smoothing. Proposition 1. Setting q = η = 0 and L = 1, where 1 is the K-dimensional vector of ones, L LORAS (S) = L LS (S) = (x,y)∈S |y| t=1 H(y LS t , p t ). Since setting q = η = 0, and the rank parameter r = 1, we have that L = 1 is in the solution path of minimizing the LORAS loss given in (4). Therefore, adaptive label smoothing strictly generalizes label smoothing.

5. EXPERIMENTS

We evaluate LORAS on three semantic parsing data sets: ATIS (Price, 1990) , SNIPS (Coucke et al., 2018) , and TOPv2, (Chen et al., 2020) , and a question answering data set: Overnight (Wang et al., 2015) -where the goal is to predict a executable logical form that can be executed against a database to answer a natural language query. On TOPv2 we evaluate LORAS in the few-shot setting as in (Chen et al., 2020) and for Overnight we consider a transfer learning setting which is detailed in the next paragraph. All data sets were pre-processed exactly as in (Chen et al., 2020) . Similar to Chen et al. ( 2020) we use the state-of-the-art model from (Rongali et al., 2020) as our main model. While for ATIS, SNIPS, and TOPv2 we use ontology only generation where the model is only allowed to generate ontology tokens while copying everything else from the utterance, for Overnight data set we place no such constraints on the model since many of the output tokens are neither part of the ontology nor are part of the utterance. We experiment with two different pretrained language models for initialization: (encoder only) RoBERTa (Liu et al., 2019) and (encoder and decoder) BART (Lewis et al., 2019) . For Overnight we only use BART. We compare the performance of the model with different pre-trained representations with no label smoothing, vanilla label smoothing, and LORAS. For vanilla label smoothing we experiment with α ∈ {0.1, 0.2, 0.3} and report the best accuracy. An α value of 0.1 is typically used in the literature, while an value of more than 0.2 produces inferior results. For LORAS, we experiment with α ∈ {0.1, 0.2, 0.3}, a few different rank and dropout parameters, set η = 0.1 for all the experiments and report the best accuracy. We found that for BART, LORAS dropout parameter of 0.5 worked best, while for RoBERTa dropout of 0.6 worked best. For all three data sets a rank parameter of 25 worked best for LORAS. We did not scale beyond rank 25 since the BART model is fairly memory intensive leaving little room to use large label embedding matrices during training. Few-shot and transfer-learning setting. In addition to the standard supervised setting, we also evaluate LORAS on the few-shot and transfer learning setting in order to verify the hypothesis that having a more informed smoothing distribution would help when training data is limited. In particular, for TOPv2 we follow the same few-shot domain adaptation setting as in Chen et al. (2020) , where 6 domains in the TOPv2 data set are used as source domains while the remaining 2 (reminder and weather) serve as target domains. For Overnight we consider the four largest domains as source domains and the four smallest domains as target domains. For Overnight, the target 

6. RESULTS

We use the exact match accuracy, referred to as frame accuracy (FA) -where the predicted sequence is matched exactly with the target sequence -as the primary evaluation metric following common practice in literature (Rongali et al., 2020) . In addition to FA, we also report the semantic accuracy (SA), where we remove the slot values and only compare the resulting target and predicted sequence. For instance, if the target sequence is: [in:playmusic [sl:year eighties ] [sl:artist adele ] ], and the predicted sequence is: [in:playmusic [sl:year eighties m ] [sl:artist adele ] ], then the frame accuracy would be zero while the semantic accuracy would be one, since removing the slot values makes the two sequences identical. Since often times the model copies over extraneous characters for a slot value, semantic accuracy provides a measure of semantic understanding of the input that is robust to these errors. For Overnight we consider the logical form exact match accuracy as was done in Damonte et al. (2019) . Even though we motivate LORAS for structured prediction problems, for ATIS and SNIPS which only contain flat (non-hierarchical) frames, we evaluate vanilla label smoothing and LORAS on top of a BERT-based joint intent and slot tagging model (Chen et al., 2019) . Chen et al. (2019) do not use any label smoothing and report state-of-the-art results for ATIS and SNIPs. For the BERT-based joint intent and slot tagging model we added label smoothing (vanilla and LORAS) only for the slot tagging component since the number of intents were too small for label smoothing to be useful. We ran each model three times with three different random initialization and report the test set exact match (EM) accuracy for the model that had the best validation set EM performance. Table 1 shows the performance of our method on ATIS and SNIPS data sets. LORAS consistently out-performs vanilla label smoothing and training with hard targets (no label smoothing) in almost all cases, while using label smoothing sometimes results in poorer performance over using hard targets. Since the ATIS data set contains simple, non-hierarchical queries, we observe that with BART the SA of the models is pretty close to the FA. On SNIPS we observe that the BART based models achieve almost perfect SA score with LORAS further improving SA by 0.3% over standard LS. We also observe a 1% drop in performance in using standard label smoothing over using hard targets. As we will show in the subsequent section, uniformly smoothing out the labels hurts performance of the models on 2019) are obtained from their paper (Table 3 ). parts of the logical form for which the model is fairly accurate on. Lastly, from the results for the slot tagging we see that LORAS helps even in the multi-class classification settingfoot_2 . To further investigate if the improvements come from LORAS discovering meaningful structure in the label space and not due to chance, we visualize the noise distributions learned by LORAS on the SNIPS data set and observe that it learns very distinct noise distributions for B and I tags (see Appendix B). Table 2 shows the few-shot performance of LORAS on the TOPv2 data set and Table 4 shows the results on Overnight. For TOPv2, we also include the state-of-the-art BART+Reptile method (Chen et al., 2020) for comparison. On TOPv2, we see that LORAS comfortably out-performs label smoothing on both target domains. On the reminder domain, which contains logical forms with many hierarchical and nested structures, LORAS improves by 2.5%. Note that using label smoothing hurts performance as compared to training with hard (true) targets. Müller et al. (2019) also report that using label smoothing hurts knowledge distillation performance. What is surprising is that the performance of our simple LORAS technique can match the performance of the sophisticated meta-learning method in Chen et al. ( 2020) which is specifically designed for domain adaptation. For the more complex reminder domain with many nested structures in the parse tree, we observe that label smoothing degrades performance by as much as 3.1% as compared to LORAS. Finally, we also observe that, overall, LORAS provides the highest SA performance among all methods. On Overnight, LORAS improves upon both vanilla label smoothing and no label smoothing by 1% on an average and on the housing domain, which contains the most complex logical forms, LORAS improves by around 6% over the model with no smoothing. Note that our seq2seq model significantly outperforms (avg. improvement around 20%) the previous state-of-the-art on Overnight data set (Damonte et al., 2019) . Exploring the structure recovered by LORAS. In this section, we explore the structure in the label space that is recovered by LORAS in the TOPv2 and Overnight data sets by looking at the matrix S = LL learned from data. Since the matrix S is larger than 50k × 50k, we visualize an informative sub-matrix in Figure 2 . For each target token k, we visualize the noise distribution softmax(S k, * ) over a subset of vocabulary tokens. To select the most informative set of target tokens, we select the top 10 target tokens with lowest noise distribution entropy, and 5 target tokens with the highest noise distribution entropy. These noise distributions are shown in the top 10 and bottom 5 rows of Figure 2 respectively. Let the set I denote these 15 target tokens, Figure 2 visualizes (softmax(S)) I,I , where the softmax is taken over the second dimension of S. From Figure 2 we immediately observe that the noise distributions with the highest entropy correspond to utterance tokens and those with the lowest entropy correspond to ontology or special tokens like end of sentence (</s>). While the noise distribution corresponding to utterance target tokens are close to uniform, the noise distribution for ontology tokens places most of the mass on two tokens: ] and </s>. Thus as expected, LORAS groups the vocabulary into distinct groups: ontology and special tokens, and utterance tokens. Further, among the non-utterance tokens we observe three distinct noise distributions corresponding to special tokens ], </s>, and the rest of the ontology tokens. The noise distributions are also pretty informative, for instance, when decoding, the model has to decide between terminating a logical form with a closing bracket or a end-of-sentence tag. Therefore, these two tokens are most similar as is reflected in the noise distribution. Due to using rank parameter of 25 in experiments, we are not able to further discern patterns within the ontology tokens. However, it must be noted that learning more finer-grained relationships between labels would come at the cost of higher sample complexity. To summarize, the low rank constraint coupled with the entropy constraint, helped discover relationships between ontology tokens and special tokens while learning an almost uniform noise distribution over the utterance tokens. Model calibration with LORAS. In this section we dive into the issue of model calibration under no label smoothing, standard label smoothing, and LORAS. Intuitively, a probabilistic model is calibrated if the model's posterior probabilities are aligned with the ground-truth correctness likelihood (Guo et al., 2017) . We use the framework of (Guo et al., 2017) to measure (mis-)calibration of the models using the expected calibration error. Label smoothing has been shown to improve calibration in image classification and machine translation (Müller et al., 2019) , and on text classification using pre-trained transformer models (Desai & Durrett, 2020) . We evaluate the test set model calibration in the few-shot semantic parsing setting on the TOPv2 data set. Similar to (Müller et al., 2019) , we evaluate calibration of the conditional probabilities p(y t = k | x, y 1:t-1 ) on the test set of combined target domains on those examples where the predicted prefix y 1:t-1 is correct. Figure 3 shows the calibration plots along with the expected calibration error. We see that label smoothing does not improve model calibration in the structured prediction setting while LORAS produces fairly well-calibrated model reducing the expected calibration error by almost half. This is due to the fact that label smoothing makes the model equally uncertain about all tokens including top-level intents, closing brackets among others. While as observed in Rongali et al. (2020) and in our experiments, seq2seq models are good at learning the grammar of the parse trees, i.e., they produce well formatted trees with balanced bracket almost always. So the models should naturally be more confident about the location of </s> and ] tokens, while being less confident about novel intents and slots in the target domain. However, by uniformly perturbing the true targets, label smoothing makes the model equally uncertain about all kinds of target tokens thereby hurting calibration.

7. RELATED WORK

The main idea behind LORAS is to mix a non-uniform distribution, that is learned jointly with the model parameters, with the one-hot encoding of the true target to compute smooth labels or soft targets. Mixing non-uniform distributions with the one-hot encoding of the true target has been explored in different contexts using various approaches in the computer vision literature. In the context of addressing overconfident predictions of deep neural networks Guo et al. (2017) ; Pereyra et al. (2017) propose penalizing the models output distributions using a hinge loss. While Reed et al. (2014) propose mixing the models predicted class probabilities with the true targets to generate soft targets to train deep networks on noisy labels. Lastly, knowledge distillation (Hinton et al., 2015) , which involves using the output probabilities of a teacher model to train a student model, is another way in which a model (student) is trained with smooth labels. However, none of these approaches exploit structure in the label space. As opposed to the aforementioned papers, our approach involves learning a noise distribution parameterized by a low-rank symmetric similarity matrix. We then penalize the entropy of the noise distribution rather than the model's output distribution. As a result our approach scales to large label spaces with size exceeding fifty thousand. A part of our loss function is similar to that used in Bertinetto et al. (2020) , however in Bertinetto et al. (2020) the authors assume a known hierarchical structure on the label space while we do not make such assumptions. Our method is also superficially similar to label embedding approaches (Akata et al., 2015) which involves computing embeddings of labels and learning a mapping or compatibility function between input representation and the label embedding. Some notable work in this area is that of Frome et al. (2013) who initialize the label embeddings from a langauge model trained on data from Wikipedia. Xian et al. (2016) use different label embeddings like Glove and word2vec and learn a bilinear map between input image embeddings and class embeddings for zero-shot classification. In the NLP literature, Wang et al. (2018) compute label embeddings by using pre-trained word embeddings of the words in the labels and propose an attention mechanism to compute compatibility between input word embeddings and the label embeddings. However, unlike our method, these label embeddings are a part of the model and are needed during inference. Due to this limitation, such label embeddings cannot be readily incorporated into any existing architecture. Furthermore, in structured prediction problems where the label space contains special or abstract tokens, initiating good embedding matrices is not a trivial task.

8. CONCLUSION AND FUTURE WORK

In this paper, we developed a novel extension of label smoothing called low-rank adaptive label smoothing (LORAS) which improves accuracy by automatically adapting to the latent structure in the label space of semantic parsing tasks. While we evaluated LORAS on semantic parsing tasks and a slot tagging task, we believe that it will be useful for other seq2seq and multi-class classification tasks over structured label spaces and can always be used in place of standard label smoothing since it strictly generalizes the latter. It would also be interesting to initialize (part of) the embedding matrix used for computing the noise distribution using pre-trained models. Icuisine UNK Iobject_part_of_series_type 0.17 0.17 0.15 0.15 0.15 0.14 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17 0.17 0.15 0.15 0.15 0.14 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17 0.17 0.15 0.15 0.15 0.14 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17 0.17 0.15 0.15 0.15 0.14 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17 0.17 0.15 0.15 0.15 0.14 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17 0.17 0.15 0.15 0.15 0.14 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.16 0.16 0.15 0.15 0.14 0.14 0.02 0.02 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.16 0.16 0.15 0.15 0.14 0.14 0.02 0.02 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.16 0.15 0.14 0.14 0.14 0.13 0.03 0.02 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.16 0.15 0.14 0.14 0.14 0.13 0.03 0.02 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.12 0.11 0.11 0.11 0.11 0.04 0.04 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 After some minor pre-processing, an example utterance and it's corresponding logical form from the data is shown below: Utterance: meetings that start later than the weekly standup meeting Logical form: (call:listValue (call:filter (call:getProperty (call:singleton en.meeting ) (string !type ) ) (call:ensureNumericProperty (string start_time ) ) (string > ) (call:ensureNumericEntity (call:getProperty en.meeting.weekly_standup (string end_time ) ) ) ) ) We add the function names like (call:listValue, variable types like (string, and closing bracket ) at the end of the GPT2 vocabulary as ontology tokens. The final vocabulary size was 50284. Model details. The model from Rongali et al. (2020) comprises of a seq2seq Transformer model and a Pointer Generator Network (See et al., 2017) . When using RoBERTa, only the encoder is initialized with the pre-trained embeddings while a 3-layer 256-hidden-unit decoder is trained from scratch. Unlike RoBERTa, BART provides both a pre-trained encoder and decoder which makes it an ideal option for initializing seq2seq models. Lastly, in our implementation, the model generates ontology tokens from the vocabulary while the slot values are always copied over from the utterance using the copy mechanism. For ATIS and SNIPS we did not canonicalize the data (Chen et al., 2020) , i.e. alphabetically order the slots at the same level, during pre-processing. Training details. For all our experiments we used the following early stopping criteria: we stop training after the validation accuracy does not improve for 20 epochs. All the models were trained on Nvidia Telsa GPUs with 16GB of RAM. For ATIS and SNIPS, the models were trained on a single GPU while for TOPv2 models were trained on 2 GPUs. We used Adam optimizer with default settings, inverse square root learning rate schedule, and a batch size of 32 for all our experiments.



TOPv2 data set is a newer version of the TOP data set introduced in(Gupta et al., 2018) containing 6 additional domains, which is particularly suitable for benchmarking few-shot semantic parsing methods. Our numbers for the slot tagging model without smoothing is slightly lower that those reported inChen et al. (2019). We suspect this is because of our use of a validation set to select the best model initialization rather than directly selecting the model with the highest test set numbers. We use this procedure for all our models.



get_directions [sl:destination [in:get_event [sl:name_event Eagles ] [sl:cat_event game ] ] ] ]

domain training set was about one third the size of the source domain training set. We first train our model on the combined training data from source domains and then fine-tune the model on combined training data from target domains. For TOPv2 the training and validation data are limited to 25 samples per intent and slot (25 SPIS) for each target domain. More details about our experimental setup and other data set and training details can be found in Appendix A.4.

Figure 2: A sub-matrix of the noise distribution matrix learned from the target domains on: (left) the TOPv2 data in the few-shot setting, (right) the Overnight data set in the transfer learning setting. Row labels indicate the target token, and each row shows the noise distribution over a subset of the vocabulary tokens (column labels) given the target token. (Best viewed in color.)

B p a r t y _ s i z e _ n u m b e r B p l a y l i s t _ o w n t i n g _ v a l u e B m u s i c _ i t e m B r a t i n g _ u n i t B o b j e c t _ t y p e I r e s t a u r a n t _ t y p e I p l a y l i s t _ o w n e r I g e n r e I f a c i l i t y I o b j e c t _ s e l e c t B p o i I m u s i c _ i t e m I c u i s i n e U N K I o b j e c t _ p a r t _ o f _ s e r i e s _ t y p e

Figure 4: A sub-matrix of noise distribution matrix learned over the SNIPS data set. The top ten rows shows labels with the lowest entropy noise distribution while the bottom ten rows shows labels with the highest entropy noise distributions.

Test set frame accuracy (FA) and semantic accuracy (SA) on ATIS and SNIPS data set of our seq2seq models with copying mechanism. Numbers in bold represent the best performing method for a given metric and pre-trained model. *Refers to the BERT based joint intent classification and slot tagging model ofChen et al. (2019) where frame accuracy is the sentence level exact match accuracy.

Logical form exact match accuracy on target domains in the Overnight data set(Wang et al., 2015). The numbers for the neural transition-based parser ofDamonte et al. (

A APPENDIX

A.1 LABEL SMOOTHING LOSS REFORMULATION Starting with (1), representing the outputs y as one-hot vectors, and representing the classification layer weights as a matrix W ∈ R K×d , we have the following label smoothing loss for the multi-class classification setting:where n(•) is the noise distribution, which is uniform for standard label smoothing. Writing the noise distributions as a vector n = (n(i)) K i=1 , denoting φ = x∈S φ(x), and with p(y | x, W) = exp(y Wφ(x)) /Z(W,x) where Z(W, x) = y exp(y Wφ(x)) we have:Denoting the re-scaled negative log-likelihood by l(x, y; W, α) = -(1 -α)y Wφ(x) + log Z(W, x), and since n 2 is a constant term for minimizing L(S, W) with respect to W, and dropping the term Wφ 2 2 , we arrive at the following upper bound on the label smoothing loss:We re-state the theorem for convenience.Theorem 2 (PAC-Bayesian generalization bound). Set the distribution Q(W), parameterized by W with bounded induced norm, over the weights W to be such that each column W * ,i is sampled i.i.d. from the Gaussian distribution N (Wφ, I).where N = |S| is the number of samples, then with probability at least 1 -δ the generalization error is bounded as follows:Proof. Denote l(S; W, α) = (x,y)∈S l(x, y; W, α) as the (re-scaled) negative log-likelihood of] be the expected negative log-likelihood. Choose the prior distribution to be the following product distribution parameterized by the noise distribution:From the PAC-Bayesian theorem (McAllester, 2003) we have that with probability at least/δ and for all W ∈ R K×d with bounded incduced norm:where b is the Lipschitz constant of the loss l which is bounded under our assumption that φ(•) 2 ≤ 1 and W have bounded induced norm. SettingTherefore, from ( 7) and ( 8) we have that:We re-state the proposition for convinience. Proposition 2. Setting q = η = 0 and L = 1, where 1 is the K-dimensional vector of ones,Proof of Proposition 1. With q = η = 0 we have thatSince setting q = η = 0, and the rank parameter r = 1, we have that L = 1 is in the solution path of minimizing the LORAS loss given in (4). Therefore, adaptive label smoothing strictly generalizes label smoothing.A (Wang et al., 2015) . The top four domains are chosen as source domains while the bottom four are chosen as target domains. Average depth is the average of the maximum depth of logical forms (trees) in the test set.Overnight data set. Statistics of the Overnight data set (Wang et al., 2015) is shown in Table 4 . For our transfer learning setting we chose the four smallest domains as target domains and the rest as source domains. As was done in prior work (Wang et al. (2015) , Damonte et al. (2019) ) we randomly select 20% of training data from each domain as the validation set which is used for model selection and hyper parameter tuning. We used the test set provided by the data set authors (Wang et al., 2015) for reporting the final performance of all the models.

B STRUCTURE RECOVERED BY LORAS IN THE SNIPS DATA SET

Similar to Figure 2 , Figure 4 visualizes the noise distributions of labels with the lowest entropy (top ten rows) and the those with the highest entropy (bottom ten rows). We observe that top-10 low entropy noise distributions correspond to mostly "B" tags (beginning of slot) while top-10 high entropy noise distributions (close to uniform) are learned for "I" tags. This is intuitive, since the model is more likely to confuse one B tag with another B tag (e.g. B-city vs B-state) rather than one B tag with another I tag. Therefore, by smoothing over closely related labels (tags) LORAS is able to force the model to learn better representations.

