GROOT: CORRECTIVE REWARD OPTIMIZATION FOR GENERATIVE SEQUENTIAL LABELING

Abstract

Sequential labeling is a fundamental NLP task, forming the backbone of many applications. Supervised learning of Seq2Seq models has shown great success on these problems. However, the training objectives are still significantly disconnected with the metrics and desiderata we care about in practice. For example, a practical sequence tagging application may want to optimize for a certain precision-recall trade-off (of the top-k predictions) which is quite different from the standard objective of maximizing the likelihood of the gold labeled sequence. Thus to bridge this gap, we propose GROOT -a simple yet effective framework for Generative Reward Optimization Of Text sequences. GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function. Using an iterative training regime, we first generate prediction candidates, then correct errors in them, and finally contrast those candidates (based on their reward values). As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics. Furthermore, GROOT leads to improvements of the overall decoder distribution as evidenced by the quality gains of the top-k candidates.

1. INTRODUCTION

Figure 1 : Results for our model (GROOT) vs the NLL baseline demonstrating the precipitous drop-off in quality of NLL model predictions outside the top-1. Sequential labeling tasks are ubiquitous among NLP applications. Tasks ranging from syntactic analysis (e.g., POS tagging and phrase chunking) to semantic analysis (e.g., named entity recognition, slot filling, and query segmentation), are critical components in end-to-end applications, such as search engines and goaloriented dialog systems. Advances in pretraining of generative language models (LMs) like T5 (Raffel et al., 2020) and mT5 (Xue et al., 2021) have enabled us to use the same training strategy seamlessly across these diverse sequence labeling tasks. We can fine-tune a pretrained LM by maximizing the likelihood of generating the ground-truth (human annotated) labeled data. However, in practice, the metrics and constraints we may care about remain fairly disconnected from the standard Negative Log-Likelihood (NLL) objective used to train these models. To understand this better, consider an example of an entity recognition model within an e-commerce system. This model would be typically trained on data of the following form: Input: black & decker blender under 100 Label: [BRAND black & decker] [PRODUCT blender] [PRICE under 100] While this e-commerce pipeline could utilize the model's predictions in different ways, a likely use is in retrieving candidates that match the predicted annotations. However, with models being imperfect, even well-trained models may make errors like:  X Y Z W U V Ground-truth annotation: [A X] Y Z [B W U] [C V] Top-k predictions: … [D X] [E Y] [F Z W U] [G V] ← reward value: -8.00 [A X] Y [E Z W U] [F V] ← reward value: -3.66 … [A X] Y [E Z W U] [C V] ← reward value: 0.667 Table 1 : Top-5 predictions of NLL baseline model (with errors bolded, and perfect prediction marked with (*)). [A X] Y Z W U [C V] ←

Incorrect prediction: [COLOR black] & decker [PRODUCT blender] [PRICE under 100]

Thus such a retrieval usecase may require a desired precision-recall balance, since precision errors (i.e., incorrectly annotated spans) could lead to catastrophic failures downstream -perhaps incorrectly filtering only "black" colored blenders in the above example. Unfortunately, current models do not allow us to optimize for or incorporate such complex metrics or trade-offs. Such errors are in fact commonplace as seen in Table 1 and empirically in our results. The issue is further exacerbated when we go beyond the top-1 prediction as seen in Figure 1 -with a drastic drop-off in the quality of predictions outside the top-1. In addition to identifying this shortcoming of NLL-based models, we make the following contributions in this paper: • We propose GROOT -a simple yet effective framework for training sequence labeling models to optimize (black-box) reward metrics. GROOT takes a generative sequential labeling model to learn the reward metrics associated with the output space of sequences. • We propose CML -a new Corrective Margin Loss function -that contrasts candidates with differing reward metrics, to enable the model to better understand the reward space. • We show that simply relying on candidates sampled from the decoder output distributionas proposed in prior work for machine translation (Shu et al., 2021) -does not work (and often worsens reward scores significantly). • To enable principled and targeted exploration of the output reward space, we introduce a correction function-based approach -correcting errors in predictions to explore the space. • Extensive experiments over 4 public datasets demonstrate that GROOT significantly improves rewards over competitive baselines. • Furthermore, we demonstrate that GROOT learns a better overall decoder distribution with significant gains in correlation with reward scores. With extensive experiments aimed at understanding the value of each component of GROOT, we believe our work can help significantly influence practical applications of sequence labeling models.



Figure 2: (a) An overview of the GROOT pipeline and (b) training data generation process. NLL (a) Model predictions for an example from the SNIPS validation set 0.213 book [party size number seven] in [spatial relation neighboring] [geographic poi moorpark] 0.564 book [party size number seven] in [spatial relation neighboring] [poi moorpark] 1.311 book [party size number seven] in [spatial relation neighboring] [city moorpark] (*) 1.443 book [party size number seven] in [spatial relation neighboring] [object name moorpark] 1.878 book [party size number seven] in [spatial relation neighboring moorpark] NLL (b) Model predictions for an example from the SNIPS training set 0.0003 add [artist stephen mcnally] to [playlist confidence boost] (*) 3.9628 add [entity name stephen mcnally] to [playlist confidence boost] 4.0684 add [artist stephen mcnally] to [playlist confidence] boost 4.6491 add [artist stephen mcnally] to [entity name confidence boost] 5.3953 add [artist stephen mcnally] to [album confidence boost]

