ON COMPOSITIONAL UNCERTAINTY QUANTIFICATION FOR SEQ2SEQ GRAPH PARSING

Abstract

Recent years have witnessed the success of applying seq2seq models to graph parsing tasks, where the outputs are compositionally structured (e.g., a graph or a tree). However, these seq2seq approaches pose a challenge in quantifying the model's compositional uncertainty on graph structures due to the gap between seq2seq output probability and structural probability on the graph. This work is the first to quantify and evaluate compositional uncertainty for seq2seq graph parsing tasks. First, we proposed a generic, probabilistically interpretable framework that allows correspondences between seq2seq output probability to structural probability on the graph. This framework serves as a powerful medium for quantifying a seq2seq model's compositional uncertainty on graph elements (i.e., nodes or edges). Second, to evaluate uncertainty quality in terms of calibration, we propose a novel metric called Compositional Expected Calibration Error (CECE) which can measure a model's calibration behavior in predicting graph structures. By a thorough evaluation for compositional uncertainty on three different tasks across ten domains, we demonstrate that CECE is a better reflection for distributional shift compared to vanilla sequence ECE. Finally, we validate the effectiveness of compositional uncertainty considering the task of collaborative semantic parsing, where the model is allowed to send limited subgraphs for human review. The results show that the collaborative performance based on uncertain subgraph selection consistently outperforms random subgraph selection (30% average error reduction rate) and performs comparably to oracle subgraph selection (only 0.33 difference in average prediction error), indicating that compositional uncertainty is an ideal signal for model errors and can benefit various downstream tasks.

1. INTRODUCTION

Parsing a natural language sentence into a compositional graph structure, i.e., graph parsing, is an important task of natural language understanding beyond simple classification or text generation tasks. It has been broadly applied in applications like semantic parsing, code generation and knowledge graph generation. Recently, a line of research has successfully applied sequence-to-sequence (seq2seq) approaches to these graph parsing tasks (Vinyals et al., 2015; Xu et al., 2020; Orhan, 2021; Cui et al., 2022; Lin et al., 2022b) . Despite achieving impressive results, these approaches pose a challenge in quantifying the model's predictive uncertainty on graph structures, making it hard to ensure a trustworthy and reliable deployment of NLP systems such as voice assistants (see an example in Figure 1 ). Meanwhile, most existing work on uncertainty estimation for seq2seq models focused on classification or language generation tasks (Kumar & Sarawagi, 2019; Vasudevan et al., 2019; Malinin & Gales, 2020; Jiang et al., 2021; Shelmanov et al., 2021; Wang et al., 2022; Pei et al., 2022) . However, how to quantify and evaluate compositional uncertainty, the predictive uncertainty over compositional graph elements (i.e., nodes or edges), remains unresolved for seq2seq graph parsing (see related work in Appendix A.2). In this paper, we aim to answer these questions by proposing a simple probabilistic framework and rigorous evaluation metrics. Quantifying compositional uncertainty for seq2seq graph parsing is inherently more difficult than other seq2seq tasks like machine translation or speech recognition, since there is a gap between seq2seq output probabilities and conditional probabilities on the graph. Specifically, our interest is to express the conditional probability of a graph node v concerning its parent pa(v), i.e., p(v| pa(v), x), rather than the likelihood of v conditioning on the previous tokens in the linearized string. For example, in the graph structure of Figure 1 , the subgraph rooted with node TimeSpec (in the dotted square) depends on its parent node EventSpec, while in the linearized graph, the parent node is not necessarily the previous token to the subgraph (the shaded spans). Consequently, we cannot directly quantify compositional uncertainty without bridging this gap between different probabilistic representations. To address this challenge, we propose a generic, probabilistic framework called Graph Autoregressive Process (GAP) (Section 2.1) that allows the correspondence between seq2seq output probability to graphical probability, i.e., assigning model probability for a node or edge on the graph. Thus, GAP can be used as a powerful medium for quantifying a seq2seq model's compositional uncertainty. Furthermore, to evaluate uncertainty quality, we propose a novel metric called Compositional Expected Calibration Error (CECE) to measure the model's behavior in predicting compositional graph structures (Section 2.2). Taking semantic parsing as a canonical application, in Section 3, we build a large benchmark consisting of 3 semantic parsing tasks across 10 different domains, based on which we comprehensively evaluate compositional uncertainty under distributional shift and validate its effectiveness on a practical downstream task (collaborative semantic parsing). First, in Section 3.1, we report different calibration metrics for a state-of-the-art seq2seq parser (Lin et al., 2022b) based on T5 (Raffel et al., 2020) as well as its different advanced uncertainty variants on the benchmark. We demonstrate that compared to vanilla ECE based on sequence accuracy, CECE is a better metric for reflecting distributional shift, i.e., task difficulty and domain generalization. We also notice that, despite the strong performance brought by those advanced uncertainty baselines on classification tasks, in the settings of graph parsing, the absolute advantage of these methods no longer holds when predicting graph edges. This suggests that developing uncertainty methods focused on compositional uncertainty can be a fruitful avenue for future research. Second, in Section 3.2, we validate the practical effectiveness of compositional uncertainty considering the problem of collaborative semantic parsing. In this setting, the model is allowed to send a limited number of uncertain subgraphs for human review (see Figure 1 for an example). We test the collaborative performance on the benchmark, and find that using uncertain subgraph selection consistently outperforms random subgraph selection (selecting random subgraphs on the predicted graph) with an average error reduction rate of 30%, and performs fairly close to oracle subgraph selection (selecting incorrect subgraphs on the predicted graph) with an small difference in prediction error as 0.33. This indicates that compositional uncertainty is an ideal signal for the likelihood on model error over graph elements, and can benefit various downstream tasks, e.g., human-AI collaborative parsing and neural-symbolic parsing (Lin et al., 2022a) . In summary, our work makes the following contributions: • New Framework for Compositional Uncertainty Quantification. We are the first to propose a simple and general probabilistic framework (GAP) that can quantify compositional uncertainty for seq2seq graph parsing. GAP allows us to go beyond the conventional autoregressive sequence



Figure 1: An example for graph semantic parsing in a dialogue system. The output of the model is a linearized semantic graph (left) that corresponds to the graph structure (right). The dotted square indicates the uncertain part, based on which the model can ask a clarification question.

