GROOT: CORRECTIVE REWARD OPTIMIZATION FOR GENERATIVE SEQUENTIAL LABELING

Abstract

Sequential labeling is a fundamental NLP task, forming the backbone of many applications. Supervised learning of Seq2Seq models has shown great success on these problems. However, the training objectives are still significantly disconnected with the metrics and desiderata we care about in practice. For example, a practical sequence tagging application may want to optimize for a certain precision-recall trade-off (of the top-k predictions) which is quite different from the standard objective of maximizing the likelihood of the gold labeled sequence. Thus to bridge this gap, we propose GROOT -a simple yet effective framework for Generative Reward Optimization Of Text sequences. GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function. Using an iterative training regime, we first generate prediction candidates, then correct errors in them, and finally contrast those candidates (based on their reward values). As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics. Furthermore, GROOT leads to improvements of the overall decoder distribution as evidenced by the quality gains of the top-k candidates.

1. INTRODUCTION

Figure 1 : Results for our model (GROOT) vs the NLL baseline demonstrating the precipitous drop-off in quality of NLL model predictions outside the top-1. Sequential labeling tasks are ubiquitous among NLP applications. Tasks ranging from syntactic analysis (e.g., POS tagging and phrase chunking) to semantic analysis (e.g., named entity recognition, slot filling, and query segmentation), are critical components in end-to-end applications, such as search engines and goaloriented dialog systems. Advances in pretraining of generative language models (LMs) like T5 (Raffel et al., 2020) and mT5 (Xue et al., 2021) have enabled us to use the same training strategy seamlessly across these diverse sequence labeling tasks. We can fine-tune a pretrained LM by maximizing the likelihood of generating the ground-truth (human annotated) labeled data. However, in practice, the metrics and constraints we may care about remain fairly disconnected from the standard Negative Log-Likelihood (NLL) objective used to train these models. To understand this better, consider an example of an entity recognition model within an e-commerce system. This model would be typically trained on data of the following form: Input: black & decker blender under 100 Label: [BRAND black & decker] [PRODUCT blender] [PRICE under 100] While this e-commerce pipeline could utilize the model's predictions in different ways, a likely use is in retrieving candidates that match the predicted annotations. However, with models being imperfect, even well-trained models may make errors like: Table 1 : Top-5 predictions of NLL baseline model (with errors bolded, and perfect prediction marked with (*)).

Incorrect prediction: [COLOR black] & decker [PRODUCT blender] [PRICE under 100]

Thus such a retrieval usecase may require a desired precision-recall balance, since precision errors (i.e., incorrectly annotated spans) could lead to catastrophic failures downstream -perhaps incorrectly filtering only "black" colored blenders in the above example. Unfortunately, current models do not allow us to optimize for or incorporate such complex metrics or trade-offs. Such errors are in fact commonplace as seen in Table 1 and empirically in our results. The issue is further exacerbated when we go beyond the top-1 prediction as seen in Figure 1 -with a drastic drop-off in the quality of predictions outside the top-1. In addition to identifying this shortcoming of NLL-based models, we make the following contributions in this paper: • We propose GROOT -a simple yet effective framework for training sequence labeling models to optimize (black-box) reward metrics. GROOT takes a generative sequential labeling model to learn the reward metrics associated with the output space of sequences. • We propose CML -a new Corrective Margin Loss function -that contrasts candidates with differing reward metrics, to enable the model to better understand the reward space. • We show that simply relying on candidates sampled from the decoder output distributionas proposed in prior work for machine translation (Shu et al., 2021) -does not work (and often worsens reward scores significantly). • To enable principled and targeted exploration of the output reward space, we introduce a correction function-based approach -correcting errors in predictions to explore the space. • Extensive experiments over 4 public datasets demonstrate that GROOT significantly improves rewards over competitive baselines. • Furthermore, we demonstrate that GROOT learns a better overall decoder distribution with significant gains in correlation with reward scores. With extensive experiments aimed at understanding the value of each component of GROOT, we believe our work can help significantly influence practical applications of sequence labeling models.

2.1. GENERATIVE SEQUENTIAL LABELING

Sequential labeling typically involves tagging spans of the input with their corresponding label. The emergence of Seq2Seq learning (Sutskever et al., 2014) has seen common NLP tasks like POS tagging, chunking, NER and segmentation cast as Seq2Seq text generation tasks (Vinyals et al., 2015; FitzGerald, 2020; Raffel et al., 2020) . More formally, given an input text x, we can generate the sequential text label y as y = arg max y p θ (y |x), where the output sequence probability p θ (y |x) is computed by a Seq2Seq model, denoted by the model parameters θ. Arbitrary text formats are acceptable for x and y, as long as we can interpret the output. For example, Raman et al. (2022) investigated various formats, and showed that such a generative approach using pre-trained Seq2Seq models outperforms word-level classification models (e.g., mBERT (Devlin et al., 2018) ). To train the Seq2Seq models, we typically use a (human-annotated) ground-truth dataset: a set of training examples, D train , where each example e is a pair of an input text x and its ground-truth annotation y * : e = (x, y * ). The most common supervised learning approach (Williams & Zipser, 1989) would then optimize the NLL loss: NLL θ (e) = -log p θ (y * |x). While the NLL loss-based training offers many advantages and promising results (Athiwaratkun et al., 2020; Raffel et al., 2020; Yan et al., 2021) , it also comes with significant drawbacks. Consider the top-5 predictions of an NLL loss-based model (Table 1 ) on the SNIPS dataset (Coucke et al., 2018) . Aside from the clear overfitting issue (commonly observed in these models (Bishop & Nasrabadi, 2006 )), we also find that the top-5 predictions have numerous different errors. While any ML model is bound to make prediction errors in practice, we should recognize that different errors affect downstream applications differently. For example, some applications may prefer recall errors / false negatives (i.e., unannotated spans) over precision errors / false positives (i.e., incorrectly annotated spans) or vice-versa. However the NLL loss-based training does not provide for an easy way to incorporate such desiderata or tradeoffs.

2.2. REWARD OPTIMIZATION AND CONTRASTIVE LOSS

Recent attempts have been made to more closely correlate the output distribution of text generation models (esp. for machine translation) with a desired reward metric (Ranzato et al., 2016; Shen et al., 2016; Edunov et al., 2018; Choshen et al., 2020) . Most notably, Shu et al. (2021) investigated an efficient and effective method of reward optimization for large neural machine translation models. They have proposed a contrastive loss to compare the best and worst prediction candidates among top-k predictions, while previous related studies investigated different combinations of the candidates to define contrastive losses. However, the underlying assumption is that sampling from the decoder distribution allows for effective exploration (and learning) of the metric space. Unfortunately this assumption does hold true in most scenarios as the output distribution of training examples is highly skewed and often contains low-quality predictions outside the top-1 (see example in Table 1 ). Hence we need alternative ways to achieve meaningful exploration of the metric space.

3. PROPOSED FRAMEWORK: GROOT

As alluded to above, solely optimizing the NLL loss may not provide us sufficient exploration of the output metric space. Given a predefined (blackbox) reward function, we want the Seq2Seq model to learn an output distribution that more closely correlates with and maximizes reward values.

3.1. OVERVIEW

To tackle this issue, we propose GROOT (visualized in Figure 2 ). GROOT works by iteratively training and improving the reward of decoder outputs. More specifically, starting from the NLLbased initial model θ (1) , in each iteration 2 ≤ t ≤ T we (t-1) . As we show empirically, leveraging this correction function is key to efficiently explore the output space and significantly outperforms exploration based solely on current predictions (Shu et al., 2021 ). • (Section 3.3) Train a new improved model θ (t) using L t by applying our proposed Corrective Margin Loss that contrasts and orders different candidates based on their rewards. Henceforth we will denote the predefined (blackbox) reward as R, where R(y, y * ) denotes the reward for an example prediction y (optionally computed using the gold label y * ).

3.2. DATA GENERATION

For the t-th step of GROOT, we create a new training set L t from a (random) subset of the training data L t . In particular, for each example e = (x, y * ) ∈ L t , we do the following: -Get top-k predictions Using the previous timestep's model θ (t-1) , we perform inference on L t using a beam search to generate the top-k output candidates for each example e. -Select correctable candidate Using the input reward function R(y, y * ) we compute a reward value for each of the top-k candidates y. Based on the rewards we select ỹ, using a fixed criterion (e.g. a non-perfect candidate with highest reward in top-k). To maximize learning of the outputreward relation, we only select candidates that can be "corrected" -as described below. Based on prior work (Shu et al., 2021) , one may assume that simply comparing the top-k allows for sufficient learning of the output space. However, as we show empirically (in Section 6), simply using the top-k predictions alone is unreliable and does not allow for sufficient exploration of the output space -and at times may even worsen results (Table 3 ). Thus to help us better explore the output space in a guided manner, we introduce a key additional step: -Correct erroneous candidate To effectively learn what predictions lead to better reward values, we introduce correction functions to rectify erroneous predictions. More specifically, a correction function C(y, y * ) is designed to return a new prediction y + , such that r + = R(y + , y * ) is expected be greater than (or equal to) r = R(y, y * ). In other words, given an imperfect prediction y -and optionally the gold label y * -the correction function aims to find a prediction that improves upon or fixes errors in y. The goal of the correction function is not to find the best possible prediction y + (which would just be the gold label y * ) but rather to expose the model to novel (improved) prediction candidates -potentially quite different from the existing model's output distribution. Since correction functions will be used to improve predictions and explore the output space, they could be either algorithmic or themselves modelbased. Note that in some applications with complex reward function, we may not be able to easily devise a single correction function that always finds an improved prediction. However in such cases, we can instead combine multiple correction functions. We require that the functions together can produce a y + (s.t., r + ≥ r) with high probability. If we cannot find an improved y + from any correction functions, then we can choose to drop the example instead. Once we have this improved y + , each example e is expanded to form a new tuple: ẽ = (x, y * , r * , ỹ, r, ỹ+ , r+ ), where r * = R(y * , y * ), r = R(ỹ, y * ), and r+ = R(ỹ + , y * ).

3.3. TRAINING WITH CORRECTIVE MARGIN LOSS

To help the model best learn the output space of rewards we train the model by contrasting pairs of ỹ and ỹ+ . In particular, for each ẽ ∈ L t we use the following corrective margin loss: CML θ (x, ỹ, r, ỹ+ , r+ ) = max(0, m -log p(ỹ + |x) + log p(ỹ|x)), where m is a margin computed with a hyperparameter α: m = α(r + -r). The form of this loss function is motivated by previous work (Edunov et al., 2018; Shu et al., 2021) , where they compare candidates among the model-generated ones. By contrast, we use CML to teach the model learn to prefer the corrected ỹ+ over the original ỹ. An astute reader may notice that the above formulation of CML does not directly use the gold label y * . While this raises interesting possibilities of unsupervised variants, for the purpose of brevity we largely leave this to future work. Instead we can incorporate y * by also teaching the model to explicitly prefer y * over ỹ+ -since we would still like the model to ideally produce y * as the best candidate. Thus we add another CML term to define the following corrective ranking loss: CRL θ (e ) = CML θ (x, ỹ, r, ỹ+ , r+ ) + CML θ (x, ỹ+ , r+ , y * , r * ). To further prioritize the gold label we can still include NLL in our training loss as: λ × NLL θ (e) + (1.0 -λ) × CRL θ (e ), where λ ∈ [0, 1.0] is a hyperparameter to control the regularization effect by NLL θ (e). In practice we found that a small value of λ helped stabilize learning and reduce variance (See Figure 6 ).

4. REWARD AND CORRECTION FUNCTIONS

4.1 REWARD FUNCTION Practical applications for a generative model may span a variety of different complex reward functions which we would like to optimize for directly. Unfortunately public benchmarks do not come up with any such provided complex reward function. Thus we look to create a realistic reward function -for the running example of a sequence tagger / entity recognizer. Our reward function should be realistic enough to reflect the complexities of real-world metrics (and thus more complex than what traditional techniques can optimize) while still being understandable for us and readers. As motivated in the introduction, practical application one may weight precision and recall differently. With β controlling the importance of precision vs. recall, consider a reward of the form: R(y, y * ) = β × precision(y, y * ) + recall(y, y * ), To further challenge the model, let us look into the definition of precision for sequence labelingand highlight a practical issue that needs addressing. Consider this synthetic example: -ground truth: y * = [A X] Y Z [B W U] [C V] -prediction (1): y 1 = [D X] Y [E Z W U] [F V] (a completely imprecise prediction) -prediction (2): y 2 = X Y Z W U V (an empty prediction) The typical definition of precision would lead to precision(y 1 , y * ) = precision(y 2 , y * ) = 0, since the number of true positives is 0. However in practice these two predictions would have very different behaviors. When used to influence search engine behavior the first prediction would lead to far worse results (due to false positives) unlike the second prediction. This boils down to the precision metric not differentiating between false positives and empty predictions -which is a valid choice in sequential tagging problems. To more closely reflect this distinction, let us modify our definition of the span-level precision to be: precision (y, y * ) = TP -c × FP TP + FP , where TP and FP stand for true positive and false positive, respectively, and c (≥ 0) is a hyperparameter to explicitly penalize false positives. When c = 0 we recover the classical precision metric. This small change now allows us to easily distinguish the above two predictions y 1 and y 2 : precision (y 1 , y * ) = 0 -c × 3 0 + 3 = -c < precision (y 2 , y * ) = 0. As an illustrative reward, we set (c, β) = (2, 4) for our reward function in the rest of the paperleading to reward scores between -8 (completely wrong) to +5 (perfect). We would like to reinforce that the above is just one example of a complex but realistic reward function. Experiments with other reward functions yielded similar -if not stronger -empirical findings (see Appendix G). 

4.2. CORRECTION FUNCTION

The goal of the correction function is to help explore the output space by yielding improved predictions y + = C(y, y * ) such that R(y + , y * ) ≥ R(y, y * ) w.h.p. While the suitability of a correction function may depend on the desired reward function, in this section we define three correction operations that we found widely effective across a slew of sequential tagging reward functions: Drop, Replace and Annotate. Drop: Is designed to fix precision errors by dropping tagged spans in the prediction y that do not match with the ground truth y * . In addition to improving precision, this also often enables exploring new candidates -since NLL training does not explicitly teach the model to drop uncertain tags. For the example y 1 in Section 4.1, this would result in all predicted spans being dropped. Replace: Works by replacing incorrect tags in y with the correct ground-truth tags. For y 1 this would result in fixing "[D X]" and " [F V]" to produce "[A X] Y [E Z W U] [C V]". Annotate: Improves recall by annotating untagged spans with the correct ground-truth tags. We can also combine these operations. For example combining all three can lead to perfectly fixing any predictions. Given our example reward's focus on precision, we combine the first two (which improve precision) to give us our proposed correction function (see Algorithm 1). This function not only fixes most possible errors in predictions, but also helps efficiently explore the output space.

5.1. DATASETS

For a comprehensive evaluation, we used six different datasets spanning different domains and tasks as detailed in Table 2 . ATIS (Price, 1990) and SNIPS (Coucke et al., 2018) are two slot-filling tasks covering the travel (e.g., booking a flight ticket) and virtual assistant domains, respectively. MIT-restaurant (MIT-R)foot_0 and MTOP (Li et al., 2021) are semantic parsing datasets from the dining (e.g., ordering a dish) and voice assistant domains, respectively. The CoNLL 2003 dataset (Tjong Kim Sang & De Meulder, 2003) is from a news domain (i.e., Reuters) for the NER task. Lastly, the CoNLL 2000 dataset (Tjong Kim Sang & Buchholz, 2000) is also from a news domain (i.e., WSJ) for the syntactic chunking task. For space reasons, we report results for the latter two sets in the Appendix F. Additionally, to verify our findings hold across languages we also evaluated on the French and Hindi MTOP datasets with results are reported in Appendix A.

5.2. METHODS COMPARED

Our primary comparison is against the current standard -an model NLL θ trained solely using the NLL loss. For the most fair and competitive baseline, we train the NLL model using the entire training set. This also forms the initialization point for all iterative models. Note that this means our iterative models do not benefit from any data specifically kept for iterative training. While previous work (Raman et al., 2022) has shown that a generative NLL-based approach outperforms token-classification methods, for the sake of completeness we evaluate three additional token classification baselines and report results in Appendix H. We also compare against the best existing (and most relevant) approach for reward optimization denoted as Shu++ (Shu et al., 2021) . Shu++ relies solely on the decoder top-k for exploring the reward space, and defines a contrastive loss between best-and worst-reward candidates; the comparison with Shu++ thus helps understand benefits of our correction function for better exploration in the reward optimization framework. In addition to GROOT, we also evaluate a variant GROOT NG (NG: No Gold) which does not use the gold label in the corrective loss i.e. using CML θ (x, ỹ, r, ỹ+ , r+ ) instead of CRL (in Equation ( 4)).

5.3. MODEL

As an instance of a competitive sequence-to-sequence model, we use pre-trained mT5 (Xue et al., 2021) and build on the T5X code base (Roberts et al., 2022) . We use a Base-sized model to maximize experimentation but verify results hold larger models (XXL-sized) in the Appendix. We use the Adafactor optimizer (Shazeer & Stern, 2018) to train all models, along with Z-loss regularization (de Brébisson & Vincent, 2016) . A constant learning rate of 0.001 is used for the NLL-only training stage, and 0.0001 otherwise. NLL training is run for upto 2500 steps (evaluating checkpoints after every 100 steps), while iterative reward optimization runs for upto 200 steps (eval every 10 steps). For both NLL and iterative reward optimization we select the best checkpoint θ (t) per the reward metric of the top-1 prediction on the validation set D val . Test set metrics are reported from the best checkpoint from the last iteration.

5.4. DEFAULT PARAMETER SETTINGS OF PROPOSED FRAMEWORK

Below are the default (and recommended) settings of our proposed framework. The effect of these different parameters are analyzed in Section 6.1 onwards and the Appendix. -Text format: We use the "sentinel+tag (SI)" format to represent the input and output texts as recent studies (FitzGerald, 2020; Raman et al., 2022) have found that to be most effective and ideal to minimize hallucinations in text generation. -Data preparation: We set the number of reward optimization iterations as T = 9 (Section 3.2) and create L t from the entire training set at each iteration. -Beam size: We set k = 5 as a beam size for the top-k prediction stage in both training and evaluation. -Candidate selection: To select the candidate to improve ỹ from the top-k, we default to selecting the highest reward candidate. -Candidate correction: We use the correction function described in Section 4.2. -Loss function: We set (α, λ) = (0.5, 0.1) for our loss function in Section 3.3.

Note:

To avoid any "human overfitting", all development of the methodology (incl. refinements and ablations) was conducted using only the validation sets. Test sets were only used to report final results for the paper (in a completely human-blind fashion).

5.5. EVALUATION METRICS

We evaluate each example by the following four metrics, and report macro averaged scores. All the reported scores are based on mean and standard deviation across five different runs. Results for additional metrics are reported in the Appendix. -Top-1 reward: Equation ( 5 

6. RESULTS AND DISCUSSIONS

As seen from the test results in Tables 3 and 8 , both GROOT variants significantly outperform the NLL and Shu++ baselines across (nearly) all metrics on all six datasets. GROOT not only improves the reward of the top-1 prediction consistently, but also significantly improves the quality of the top-k predictions as evidenced by the large increases in the Average Reward scores. GROOT methods also score highest when looking at the highest reward scores among the top-k. The improved rank correlations provide the final confirmation that the GROOT models do indeed gain a better understanding of the output space in regards to the provided (blackbox) reward function. These results also demonstrate the importance of efficiently exploring the output space. As seen in Table 3 , the Shu++ model is often no better -and in many cases significantly worse -than the NLL baseline. This clearly illustrates that simply relying on the top-k may not suffice. More specifically the lower quality of the top-k of the NLL baseline provides very little room for hillclimbing towards better reward scores, which in turn leads to the poor scores observed. In contrast the GROOT methods outperform Shu++ on every metric, proving that the correction-function based exploration coupled with the contrastive loss functions leads to a significantly better model.

6.1. EFFECT OF LOSS FUNCTIONS

Figure 3 shows how the top-1 and average reward values change over the course of the iterative training process. We can clearly see that GROOT methods smoothly improve the reward scores and improve the decoder output distribution. Between the GROOT variants, we find that GROOT NG tends to more aggressively optimize the intended reward (since it is not anchored to also optimize the gold label probability). This does come up with a drawback as other metrics different than the provided reward may degrade (as shown in Table 7 in Appendix), due to the focus on the provided reward. Regardless the strong performance of GROOT NG hints at the promise of such techniques even being used in low supervision / unsupervised settings. However we leave this to be explored in future work and instead focus on the vanilla GROOT model in the rest of the analyses.

6.2. EFFECT OF CORRECTION FUNCTION

Figure 4 shows the effect of the correction function used. While "Drop+replacement" corresponds to our default correction function (Algorithm 1), "Only drop" only uses the drop operation. We can see that the drop operation by itself is quite effective, due to the increased exploration it offers relative to NLL-only training. Coupled with the earlier results, we clearly see how correction functions can help break the limits of exploration in Seq2Seq models. While alternatives such as decoder distribution sampling (Ranzato et al., 2016; Ackley et al., 1985; Ficler & Goldberg, 2017) do exist, the highly skewed output distributions of the NLL model (as seen in Table 1 ) often constrain exploration and need significant tuning of temperature parameters to smooth out the distribution. Thus we conclude that correction functions provide a far simpler and more effective way of exploring the output space.

6.3. EFFECT OF CANDIDATE SELECTION STRATEGY

Figure 5 shows how different candidate selection strategies affect the results. While our default choice corresponding to "Best" is often the most consistent, other strategies such as "Worst" (lowest reward candidate) or "Random" are also effective at helping the model learn reward contours. Thus we believe that our framework is not sensitive to the selection strategies, and could even incorporate multiple candidates via different ranking objectives (Jagerman et al., 2022) .

6.4. EFFECT OF λ

Figure 6 shows how the NLL loss contribution (controlled by λ in Equation 4) affects results. In general we find that a small contribution of NLL -as seen in the λ = 0.1 curves -generally help to stabilize learning, while larger values still do benefit from iterative improvements of reward (albeit slightly less). λ = 0.0 though was fairly unstable training, and thus we recommend setting λ > 0.

7. CONCLUSION

We have presented a simple yet effective framework, GROOT, and empirically verified the effectiveness on sequential labeling tasks across multiple tasks and languages. Via the use of a correction function based exploration and a contrastive loss, GROOT is able to directly and effectively optimize a Seq2Seq model for the given blackbox reward function. Furthermore, unlike standard maximum likelihood training, the output distribution of GROOT also significantly improves and is well correlated with the desired reward metric. 

B RESULTS WITH MT5 XXL

We mainly used mT5 BASEfoot_1 to maximize experimentation and analyze the effectiveness of different aspects of our proposed framework. While this is already a large transformer model (Vaswani et al., 2017) , a natural question one may ask is whether the observed gains carry over to even larger models. To answer this question, we verify the effectiveness using mT5 XXL.foot_2 With 13B+ parameters, the XXL variant is significantly larger than mT5 BASE as described in Xue et al. (2021) . For example, mT5 BASE consists of 12 transformer layers with 768-dimensional embeddings and 2048dimensional Multi-Layer Perceptrons (MLPs), while mT5 XXL consists of 24 layers with 4096dimensional embeddings and 10240-dimensional MLPs. Table 5 shows the results. When compared with the results in Table 3 , it is unsurprising to see significant gains for all methods using mT5 XXL over mT5 BASE. However we still observe the same trends with GROOT improving reward-based metrics. It is interesting to note that while the average reward scores improve significantly by simply using the larger model (except on ATIS), GROOT still helps maintain a sizeable gap. Thus we believe our reward optimization framework will help even when larger and larger models are invented in the future. 

D ADDITIONAL EVALUATION METRICS

In addition to the evaluation metrics introduced in Section 5.5, we also report the results with the following metrics for reference: -Top-1 precision : Equation ( 6 As seen in Figure 10 despite starting from a much worse initialization, the GROOT model quickly recovers outperforming NLL and nearly catching upto the "all" performance on all datasets. For reference, we also show Figure 11 . We believe with a more careful data strategy and/or a larger training set, we could see further gains. 

F EXPERIMENTS ON DIFFERENT TASKS

This section shows how GROOT works on the NER and chunking tasks, where all the experimental settings are consistent with those on the ATIS, SNIPS, mTOP, and MIT-R datasets. Table 8 shows the results. We can see that GROOT consistently improves upon the NLL baseline in all the rewardbased metrics on the two tasks, while Shu++ does not. Those results show the generalization ability of our proposed framework across different tasks. 

G EXPERIMENTS WITH DIFFERENT REWARD FUNCTIONS

As discussed in Section 4.1, we limited our exposition in the main body of the paper to a single reward function for simplicity. However we found similar -if not stronger -empirical results with other reward functions as described in this section.

G.1 DEFINITIONS: VARIANTS OF REWARD FUNCTIONS

Our default reward function defined by Equations ( 5) and ( 6) -which we denote as R PR linearly interpolated a variant of precision and recall. This reward function inherently favors a small number of highly confident spans. The results in Table 7 show such a tradeoff. To explore a different set of rewards with potentially different characteristics we took inspiration from the popular Jaccard similarity (Jaccard, 1912) and the Tversky index (Tversky, 1977) .

G.1.1 MODIFIED JACCARD SIMILARITY

The Jaccard similarity can be defined as: JAC(y, y * ) = |Y ∩ Y * | |Y ∪ Y * | = TP TP + FP + FN , ( ) where Y is the set of the predicted spans, Y * is the set of the ground-truth spans, and FN stands for false negative. Compared with precision in Equation ( 6), this already takes recall into account by false negatives. As another example of a more complex reward function, we can extend this with a similar penalty coefficient c to obtain: 

G.1.2 MODIFIED TVERSKY INDEX

The Tversky index is a generalization of the Jaccard similarity and Sorensen-Dice coefficient (Sorensen, 1948; Dice, 1945) which allows for fine-grained control of the balance between different error classes: TVE(y, y * ) = |Y ∩ Y * | |Y ∩ Y * | + γ × |Y \ Y * | + ω × |Y * \ Y | = TP TP + γ × FP + ω × FN , where γ and ω are hyperparameters. We can further extend this in an analogous manner as above to define the following reward function: We use the MTOP (en) and MIT-R datasets to verify that GROOT effectively works with these reward functions. We simply replace the original reward function with the new reward functions and did not change any other settings including the hyperparameters. The only change we change we make is to re-scale the value of R JAC and R TVE by a factor of (β + 1.0) in the loss function to align the value ranges of rewards similar to what we had previously. We report evaluation scores for the associated reward function that was used for training in each setup. For example, we use R JAC for evaluation if R JAC is used for training. We report results (averaged over 5 runs) for {R JAC , c = 1.0} and {R TVE , c = 1.0, γ = 0.5, ω = 1.0} . We note that we have observed consistent results with other setups.

G.3 RESULTS

Tables 9 and 10 show the results. We can see that GROOT consistently improves upon the NLL baseline in all the reward-based metrics, while Shu++ does not. These results empirically verify the robustness of our proposed reward optimization framework.

H TOKEN CLASSIFICATION RESULTS

Previous works (Raman et al., 2022) has shown that a generative NLL-based approach outperforms other token-classification approaches. However, for the sake of completeness we implemented and evaluated three additional baselines:



https://groups.csail.mit.edu/sls/downloads/. https://github.com/google-research/t5x/blob/main/t5x/examples/t5/mt5/ base.gin. https://github.com/google-research/t5x/blob/main/t5x/examples/t5/mt5/ xxl.gin.



Figure 2: (a) An overview of the GROOT pipeline and (b) training data generation process. NLL (a) Model predictions for an example from the SNIPS validation set 0.213 book [party size number seven] in [spatial relation neighboring] [geographic poi moorpark] 0.564 book [party size number seven] in [spatial relation neighboring] [poi moorpark] 1.311 book [party size number seven] in [spatial relation neighboring] [city moorpark] (*) 1.443 book [party size number seven] in [spatial relation neighboring] [object name moorpark] 1.878 book [party size number seven] in [spatial relation neighboring moorpark] NLL (b) Model predictions for an example from the SNIPS training set 0.0003 add [artist stephen mcnally] to [playlist confidence boost] (*) 3.9628 add [entity name stephen mcnally] to [playlist confidence boost] 4.0684 add [artist stephen mcnally] to [playlist confidence] boost 4.6491 add [artist stephen mcnally] to [entity name confidence boost] 5.3953 add [artist stephen mcnally] to [album confidence boost]

) for the top-1 candidate by the beam search. (Maximum value: 5.0) -Average reward: Reward averaged across the top-k candidates. (Maximum value: 5.0) -Max reward: The highest reward value among the top-k candidates. (Maximum value: 5.0) -Rank correlation: Spearman's rank correlation coefficient to measure if the model has correctly predicted the top-k candidates in order of their reward values. (Maximum value: 1.0)

Figure 3: Performance of different loss functions on the top-1 and average reward metrics. For all figures (here and below) error bars denote standard deviations across 5 runs.

Figure 4: Comparison between different correction functions.

Figure 5: Comparison between different candidate selection strategies.

Figure 6: Comparison between different values of λ in Equation (4).

) for the top-1 candidate. (Maximum value: 1.0) -Top-1 precision: The standard precision metric for the top-1 candidate. (Maximum value: 1.0) -Top-1 recall: The standard recall metric for the top-1 candidate. (Maximum value: 1.0) -Top-1 EM: A binary score to check if the top-1 candidate perfectly matches the ground-truth annotation. (Maximum value: 1.0)

Figures 7, 8, 9, and 11  show how the average reward scores are improved in the course of the iterative reward optimization. The average reward scores are smoothly improved in all the cases. We can also see similar trends that are observed in the cases of the top-1 reward metric.

Figure 7: Comparison between different values of λ in Equation (4) for the average reward metric.

Figure 8: Comparison between different correction functions for the average reward metric.

Figure 9: Comparison between different candidate selection strategies for the average reward metric.

Figure 10: Comparison between different data preparation strategies.

Figure 11: Comparison between different data preparation strategies for the average reward metric.

JAC (y, y * ) = |Y ∩ Y * | -c × |Y \ Y * | |Y ∪ Y * | = TP -c × FP TP + FP + FN ,

TVE (y, y * ) = |Y ∩ Y * | -c × |Y \ Y * | |Y ∩ Y * | + γ × |Y \ Y * | + ω × |Y * \ Y | = TP -c × FP TP + γ × FP + ω × FN .

Test set results on the four datasets. Standard deviation values are over 5 runs. Bolded values and the † symbol indicate 99% significance (via z-test) vs. NLL and Shu++ respectively.

shows results on the MTOP dataset in Hindi and French. The effectiveness of our method is consistently observed in Hindi and French as well.

Test set results on the Hindi and French MTOP dataset.

Test set results with mT5 XXL.

shows how the prediction examples in Table 1 (a) are modified by GROOT. If the model is not confident enough about the tag of "moorpark," the top-1 prediction leaves it untagged. NLL Model predictions for an example from the SNIPS validation set 0.154 book [party size number seven] in [spatial relation neighboring] moorpark 0.712 book [party size number seven] in [spatial relation neighboring] [geographic poi moorpark] 1.303 book [party size number seven] in [spatial relation neighboring] [poi moorpark] 1.383 book [party size number seven] in [spatial relation neighboring] [city moorpark] (*) 2.004 book [party size number seven] in [spatial relation neighboring] [object name moorpark]

Top-5 predictions refined by GROOT. This can be contrasted with Table 1 (a).

shows the results, corresponding to those in Table3. As expected, the precision scores are significantly improved, while slightly degrading recall (and hence EM).

Additional evaluation metrics for the results in Table3.

Test set results on the NER and chunking tasks.

MTOP (en)

MIT-R NLL Shu++ GROOT NLL Shu++ GROOT Top-1 reward 0.836 ± 0.012 0.845 ± 0.001 0.837 ± 0.009 0.621 ± 0.027 0.618 ± 0.014 0.651 ± 0.005 Average reward 0.133 ± 0.013 0.329 ± 0.033 0.200 ± 0.005 0.172 ± 0.019 0.100 ± 0.062 0.263 ± 0.010 Max reward 0.958 ± 0.003 0.947 ± 0.004 0.963 ± 0.002 0.884 ± 0.162 0.860 ± 0.005 0.890 ± 0.006 Rank correlation 0.701 ± 0.009 0.708 ± 0.002 0.811 ± 0.017 0.614 ± 0.006 0.646 ± 0.020 0.673 ± 0.004 • mBERT: This uses an mBERT (encoder-only) model (Devlin et al., 2018) to predict the per-token labels. We used the Base-sized mBERT model as the most comparable with the Base-sized mT5 backbone used in most of our experiments. • mT5: As shown in prior work (Lewis et al., 2019; Raman et al., 2022) , generative models (like T5) can also be used to predict per-token labels. Thus we used the same mT5-Base model to predict the sequence of BIO token labels. As recommended by Raman et al. (2022) , we use a single Inside label -rather than a per-class label -to improve performance. • mT5 (+Input): The results from Raman et al. (2022) showed that learning to generate the input tokens along with the BIO token labels, leads to better performance using mT5 models. Thus we evaluated this as an additional baseline.The results for these additional baselines are provided in Table 11 . When compared with the other methods (Table 3 for R PR , Table 9 for R JAC and Table 10 for R TVE ), we find that the GROOT significantly outperforms all three token classification approaches -a further validation of our approach. Note that GROOT may potentially be helpful in the token classification setting as well.However we leave this for future work to explore, given the general potency of the generative approach. 

