ON COMPOSITIONAL UNCERTAINTY QUANTIFICATION FOR SEQ2SEQ GRAPH PARSING

Abstract

Recent years have witnessed the success of applying seq2seq models to graph parsing tasks, where the outputs are compositionally structured (e.g., a graph or a tree). However, these seq2seq approaches pose a challenge in quantifying the model's compositional uncertainty on graph structures due to the gap between seq2seq output probability and structural probability on the graph. This work is the first to quantify and evaluate compositional uncertainty for seq2seq graph parsing tasks. First, we proposed a generic, probabilistically interpretable framework that allows correspondences between seq2seq output probability to structural probability on the graph. This framework serves as a powerful medium for quantifying a seq2seq model's compositional uncertainty on graph elements (i.e., nodes or edges). Second, to evaluate uncertainty quality in terms of calibration, we propose a novel metric called Compositional Expected Calibration Error (CECE) which can measure a model's calibration behavior in predicting graph structures. By a thorough evaluation for compositional uncertainty on three different tasks across ten domains, we demonstrate that CECE is a better reflection for distributional shift compared to vanilla sequence ECE. Finally, we validate the effectiveness of compositional uncertainty considering the task of collaborative semantic parsing, where the model is allowed to send limited subgraphs for human review. The results show that the collaborative performance based on uncertain subgraph selection consistently outperforms random subgraph selection (30% average error reduction rate) and performs comparably to oracle subgraph selection (only 0.33 difference in average prediction error), indicating that compositional uncertainty is an ideal signal for model errors and can benefit various downstream tasks. 1 * Work done while interning at Google Research.

1. INTRODUCTION

Parsing a natural language sentence into a compositional graph structure, i.e., graph parsing, is an important task of natural language understanding beyond simple classification or text generation tasks. It has been broadly applied in applications like semantic parsing, code generation and knowledge graph generation. Recently, a line of research has successfully applied sequence-to-sequence (seq2seq) approaches to these graph parsing tasks (Vinyals et al., 2015; Xu et al., 2020; Orhan, 2021; Cui et al., 2022; Lin et al., 2022b) . Despite achieving impressive results, these approaches pose a challenge in quantifying the model's predictive uncertainty on graph structures, making it hard to ensure a trustworthy and reliable deployment of NLP systems such as voice assistants (see an example in Figure 1 ). Meanwhile, most existing work on uncertainty estimation for seq2seq models focused on classification or language generation tasks (Kumar & Sarawagi, 2019; Vasudevan et al., 2019; Malinin & Gales, 2020; Jiang et al., 2021; Shelmanov et al., 2021; Wang et al., 2022; Pei et al., 2022) . However, how to quantify and evaluate compositional uncertainty, the predictive uncertainty over compositional graph elements (i.e., nodes or edges), remains unresolved for seq2seq graph parsing (see related work in Appendix A.2). In this paper, we aim to answer these questions by proposing a simple probabilistic framework and rigorous evaluation metrics. Quantifying compositional uncertainty for seq2seq graph parsing is inherently more difficult than other seq2seq tasks like machine translation or speech recognition, since there is a gap between seq2seq output probabilities and conditional probabilities on the graph. Specifically, our interest is to express the conditional probability of a graph node v concerning its parent pa(v), i.e., p(v| pa(v), x), rather than the likelihood of v conditioning on the previous tokens in the linearized string. For example, in the graph structure of Figure 1 , the subgraph rooted with node TimeSpec (in the dotted square) depends on its parent node EventSpec, while in the linearized graph, the parent node is not necessarily the previous token to the subgraph (the shaded spans). Consequently, we cannot directly quantify compositional uncertainty without bridging this gap between different probabilistic representations. To address this challenge, we propose a generic, probabilistic framework called Graph Autoregressive Process (GAP) (Section 2.1) that allows the correspondence between seq2seq output probability to graphical probability, i.e., assigning model probability for a node or edge on the graph. Thus, GAP can be used as a powerful medium for quantifying a seq2seq model's compositional uncertainty. Furthermore, to evaluate uncertainty quality, we propose a novel metric called Compositional Expected Calibration Error (CECE) to measure the model's behavior in predicting compositional graph structures (Section 2.2). Taking semantic parsing as a canonical application, in Section 3, we build a large benchmark consisting of 3 semantic parsing tasks across 10 different domains, based on which we comprehensively evaluate compositional uncertainty under distributional shift and validate its effectiveness on a practical downstream task (collaborative semantic parsing). First, in Section 3.1, we report different calibration metrics for a state-of-the-art seq2seq parser (Lin et al., 2022b) based on T5 (Raffel et al., 2020) as well as its different advanced uncertainty variants on the benchmark. We demonstrate that compared to vanilla ECE based on sequence accuracy, CECE is a better metric for reflecting distributional shift, i.e., task difficulty and domain generalization. We also notice that, despite the strong performance brought by those advanced uncertainty baselines on classification tasks, in the settings of graph parsing, the absolute advantage of these methods no longer holds when predicting graph edges. This suggests that developing uncertainty methods focused on compositional uncertainty can be a fruitful avenue for future research. Second, in Section 3.2, we validate the practical effectiveness of compositional uncertainty considering the problem of collaborative semantic parsing. In this setting, the model is allowed to send a limited number of uncertain subgraphs for human review (see Figure 1 for an example). We test the collaborative performance on the benchmark, and find that using uncertain subgraph selection consistently outperforms random subgraph selection (selecting random subgraphs on the predicted graph) with an average error reduction rate of 30%, and performs fairly close to oracle subgraph selection (selecting incorrect subgraphs on the predicted graph) with an small difference in prediction error as 0.33. This indicates that compositional uncertainty is an ideal signal for the likelihood on model error over graph elements, and can benefit various downstream tasks, e.g., human-AI collaborative parsing and neural-symbolic parsing (Lin et al., 2022a) . In summary, our work makes the following contributions: • New Framework for Compositional Uncertainty Quantification. We are the first to propose a simple and general probabilistic framework (GAP) that can quantify compositional uncertainty for seq2seq graph parsing. GAP allows us to go beyond the conventional autoregressive sequence probability and express parent-child conditional probability on the graph, which is compatible with any graph parsing problem and autoresgressive model. • Rigorous Metrics for Compositional Uncertainty Calibration. We introduce a novel quality measurement for compositional uncertainty (CECE) which can evaluate the model's calibration on graph elements and provide a better interpretation of the model's behavior in predicting graph structures under distributional shift. • Practical Effectiveness. We comprehensively evaluate the calibration behavior of modern pretrained large language models, i.e., T5 (Raffel et al., 2020) , in compositional uncertainty quantification on a broad range of semantic parsing tasks of varying complexity (Redwoods, SNIPS, and SMCalFlow). We further evaluate the benefit of compositional uncertainty quantification in enabling new capacity (e.g, fine-grained collaborative prediction for complex semantic parsing) in downstream applications. Specifically, our results show that compositional uncertainty can significantly benefit collaborative parsing performance, with only a 0.33 difference in prediction error compared to the headroom.

2. QUANTIFYING AND EVALUATING COMPOSITIONAL UNCERTAINTY

Problem Formulation. In this work, we interpret the term graph parsing as mapping from surface strings (usually natural language sentences) to target representations that are explicitly or implicitly graph-structured. Formally, the input is a natural language utterance x, and the output is a DAG G = ⟨N, E⟩, where N is the set of nodes and E ∈ N × N is the set of edges. In the case of seq2seq parsing, G is represented as a linearized graph string g = s 1 s 2 • • • s L consists of symbols {s l } L l=1 . In the experiment, we use PENMAN notation (Kasper, 1989) to linearize all the formalism, which is a serialization format for the directed, rooted graphs used to encode semantic dependencies (more details are available in Appendix B). To this end, our goal is to quantify the graph-level uncertainty p(G|x) from the sequence-level probability p(g|x) generated by a seq2seq model, which we term the compositional uncertainty. For example, our interest to express the conditional probability of a graph node v with respect to its parent pa(v), i.e., p(v| pa(v), x), rather than the likelihood of v conditioning on the previous tokens in the linearized string. In the following sections, we will introduce how to quantify (Section 2.1) and evaluate (Section 2.2) this compositional uncertainty.

2.1. QUANTIFYING COMPOSITIONAL UNCERTAINTY VIA GRAPH AUTOREGRESSIVE PROCESS (GAP)

To properly model the uncertainty p(G|x) from a seq2seq model, we need an intermediate probabilistic representation to translate the raw token-level probability to the distribution over graph elements (i.e., nodes and edges). To this end, we introduce a simple probabilistic formalism termed Graph Autoregressive Process (GAP), which is a probability distribution assigning seq2seq learned probability to the graph elements v ∈ G. Specifically, as the seq2seq-predicted graph adopts both a sequence-based representation g = s 1 , ..., s L and a graph representation G = ⟨N, E⟩, the GAP model adopts both an autoregressive representation p(g|x) = i p(s i |s <i , x) analogous to that of the seq2seq model (Section 2.1.1), and also a probabilistic graphical model representation p(G|x) = v∈G p(v| pa(v), x) for proper quantification of model uncertainty on the graph (Section 2.1.2). Both representations share the same set of underlying probability measures (i.e., the graphical-model likelihood p(G|x) can be derived from the autoregressive probabilities p(s i |s <i , x)). As we will show, GAP serves as a powerful medium for quantifying compositional uncertainty for seq2seq graph parsing.

2.1.1. AUTOREGRESSIVE REPRESENTATION FOR LINEARIZED SEQUENCE g

Given an input sequence x and output sequence y = y 1 y 2 • • • y N , the token-level autoregressive distribution from a seq2seq model is p(y|x) = N i=1 p(y i |y <i , x). In the context of graph parsing, the output sequence describes a linearized graph g = s 1 s 2 • • • s L , where each symbol s i = {y i1 y i2 • • • y i N i } represents either a node n ∈ N or an edge e ∈ E of the graph and corresponds to a collection of beam-decoded tokens {y i1 y i2  by integrating over the space of subsequences s i→j between (s i , s j ) and the subsequence s <i before s i . Higher order conditional (e.g., p(s j |(s i , s l ), x)) can be computed analogously. Notice this gives us the ability to reason about long-range dependencies between non-adjacent symbols on the sequence. Furthermore, the conditional probability on the reverse direction can also be computed using the Bayes' rule: p(s i |s j , x) = p(sj |si,x)p(si|x) p(sj |x) . Efficient Estimation Using Beam Outputs. In practice, we can estimate p(s i |x) and p(s j |s i , x) efficiently via importance sampling using the output from the beam decoding {g k } K k=1 , where K is the beam size (Malinin & Gales, 2020) . The marginal probability can be computed as p(s i |x) = K k=1 p(s i |s k,<i , x) * π k , π k = exp( 1 t log p(g k |x)) K k=1 exp( 1 t log p(g k |x)) here π k is the importance weight proportional to the beam candidate g k 's log likelihoods, and t > 0 is the temperature parameter fixed to a small constant (e.g., t = 0.1) (Malinin & Gales, 2020) . If the symbol s i does not appear in the k th beam, we set p(s i |s k,<i , x) = 0. As shown, the marginalized probability p(s i |x) provides a way to reason about the global importance of s i by integrating the probabilistic evidence p(s i |s k,<i , x) over the whole beam-sampled posterior space. It is able to capture the cases of spurious graph elements s i with high local probability p(s i |s k,<i , x) but low global likelihood (i.e., only appear in a few low-probability beam candidates). Therefore, it is a useful quantity for structure induction (e.g., edge and node pruning) in graphical model inference (Dianati, 2016) . Then, for two symbols (s i , s j ) with i < j, we can estimate the conditional probability as p(s j |s i , x) = K k=1 p(s j |s i , s k,i→j , s k,<i , x) * π i k , π i k = exp( 1 t log p(g k |x)) * I(s i ∈ g k ) K k=1 exp( 1 t log p(g k |x)) * I(s i ∈ g k ) (5) here π i k is the importance weight among beam candidates that contains s i . Notice this is different from Equation 4 where π k is computed over all beam candidates regardless of whether it contains s i .

2.1.2. PROBABILISTIC GRAPHICAL MODEL REPRESENTATION FOR G

So far, we have focused on probability computation based on the graph's linearized representation p(g|x) = i p(s i |s <i , x). In this section, we further consider GAP's graphical model representation p(G|x) = v∈G p(v| pa(v), x). Specifically, given a beam sample of linearized graphs {g k } K k=1 , well-established algorithms exist to synthesize different graph predictions into a meta graph G. Briefly, these methods first convert each string g b into their original graph representation G k = ⟨N k , E k ⟩, then merge multiple graphs {G k } K k=1 using a graph matching algorithm (Cai & Knight, 2013; Hoang et al., 2021) . A visual illustration of the resulting graph G is shown in Figure 2 , where n i and e j are the candidates for the node and edge predictions collected from beam sequences. As shown, compared to the sequence-based representation g, the meta graph G (1) explicitly enumerates different candidates for each node and edge prediction (e.g., n 1 v.s. n 2 for predicting the first element), and (2) provides an explicit account of the parent-child relationships between variables on the graph (e.g., e 7 is a child node of n 3 in the predicted graph, which is not reflected in the autoregressive representation). From the probabilistic learning perspective, the meta graph G describes the space of possible graphs (i.e., the support) for a graph distribution p(G|x) : G → [0, 1]. It describes the possible node and edge variables and their dependencies on the graph G (i.e., the shaded squares in the Figure 2 ), and also different possible values for each node and edge variable (i.e., the solid squares within each shaded square in Figure 2 ). To this end, GAP assigns proper graph-level probability p(G|x) to graphs G sampled from the meta graph G via the graphical model likelihood: p(G|x) = v∈G p(v| pa(v), x) = n∈N p(n| pa(n), x) * e∈E p(e| pa(e), x) where p(v| pa(v), x) is the conditional probability for v with respect to their parents pa(v) in G. Given the candidates graphs {G k } K k=1 , we can express the likelihood for p(v| pa(v), x) by writing down a multinomial likelihood enumerating over different values of pa(v) (Murphy, 2012) . This in fact leads to a simple expression for the model likelihood as a simple averaging of the beamsequence log likelihoods: log p(n| pa(n), x) ∝ 1 K K k=1 log p(n| pa(n) = c k ) (7) where c k is the value of pa(n) in k th beam sequence, and the conditional probabilities are computed using Equation (5). See Appendix C for a detailed derivation. In summary, for each graph element variable v ∈ G, GAP allows us to compute the graphical-model conditional likelihood p(v|pa(v), x) via its graphical model representation, and also to compute the marginal probability p(v|x) via its autoregressive presentation. Algorithm 1 summarizes the full GAP computation.

2.2. EVALUATING COMPOSITIONAL UNCERTAINTY

In this section, we present how to evaluate compositional uncertainty. A common approach to evaluate a model's uncertainty quality is to measure its calibration performance, i.e., whether the Algorithm 1 Graph Autoregressive Process (GAP) 4) Compute graphical model likelihood: 7) Inputs: Beam candidates with probabilities {p(g k |x)} K k=1 , Meta graph G Output: Marginal probabilities {p(s|x)}, Graph model likelihood log p(G|x) for v ∈ G do Compute marginal likelihood: p(v = s|x) (Equation log p(v = s| pa(v), x) (Equation Return Marginal Probabilities: {p(v|x)} Graphical Model Likelihood: log p(G|x)) = v∈G log p(v| pa(v), x) model's predictive uncertainty is indicative of the predictive error, e.g., expected calibration error (ECE; Naeini et al., 2015) . In this work, we propose a compositional calibration metric based on ECE, which measures the difference in expectation between the model's predictive confidence (e.g., the maximum probability score) on graph elements (nodes or edges) and their actual match to the gold graph. Formally, during inference time, given a input x and a target graph G, we first partition the confidence interval into B equal bins I 1 , . . . , I B . Then in each measure the absolute difference between the node/edge accuracy and confidence of predictions in that bin. This gives the compositional expected calibration error (CECE G ) for graph G as: 1 |G| B b=1 vt∈ Ĝ,p(vt|x)∈I b C(v t , G) -p(v t |x) where |G| is the number of graph elements in the target graph G, vt is the t th element (node/edge) in the predicted graph Ĝ, C(v it , G i ) denotes if vit matches in the graph G i using a graph matching algorithm, and p(v t |x) can be obtained by GAP in Section 2.1. Specifically, we use the matching algorithm adopted in SMATCH (Cai & Knight, 2013) , which is the same graph matching algorithm for constructing meta graph G (see Appendix E for details). Alternatively, we can compute CECE only for node/edge predictions, namely CECE N and CECE E .

3. EXPERIMENTAL EVALUATION

Datasets. In this paper, we take semantic parsing as a canonical application, and build a large benchmark consisting of three semantic parsing tasks and covering ten different domains, ranging from graph-based grammar parsing to dialogue-oriented semantic parsing: • Redwoods: The LinGO Redwoods Treebank is a collection of hand-annotated corpora for an English grammar consisting of more than 20 datasets. The underlying grammar is called English Resource Grammar (ERG; Flickinger et al., 2014; Bender et al., 2015) , which is an open-source, domain-independent, linguistically precise, and broad-coverage grammar. ERG can be presented into different types of annotation formalism. This work focuses on the Elementary Dependency Structure (EDS; Oepen & Lønning, 2006) , which is a compact representation that can be expressed as a DAG. Following previous works, for in-domain test, we train and evaluate models on the subset treebank corresponding to the 25 Wall Street Journal (WSJ) sections with standard data splits (Flickinger et al., 2012) . For out-of-domain (OOD) evaluations, we select 7 diverse datasets from Redwoods: Wikipedia (Wiki), the Brown Corpus (Brown), the Eric Raymond Essay (Essay), customer emails (E-commerce), meeting/hotel scheduling (Verbmobil), Norwegian tourism (LOGON) and the Tanaka Corpus (Tanaka) (See Appendix D for more details). • SMCalFlow: SMCalFlow (Andreas et al., 2020) is a large corpus of semantically detailed annotations of task-oriented natural dialogues. The annotation uses dataflow computational graphs, composed of a rich set of both general and application specific functions, to represent user requests as rich compositional expressions. We use the standard data split in the original paper and evaluate inference results on development set. We transfer all the data into linearized graphs (Appendix B). To reduce the sequence length, since the number of node/edge names are limited, we set them untokenizable to the tokenizer. (↑) SMATCH(↑) F1 N (↑) F1 E (↑) ECE seq (↓) CECE G (↓) CECE N (↓) CECE E (↓) Domain distance ↑ ← ------------ WSJ (In- Evaluation Metrics. Consistent with previous work, the performance metric used for Redwoods is SMATCH score (Cai & Knight, 2013) , which computes the degree of overlap between two semantic graphs (see Appendix E for details). For SMCalFlow, we use sequence accuracy (exact match). For SNIPS, we use slot F1 score, which is equal to node F1 score when we transfer the SNIPS data into PENMAN notation. For model calibration, we report naive ECE based on sequence accuracy and compositional uncertainty CECE introduced in Section 2.2. Seq2seq Model. We adopt T5 (Raffel et al., 2020) as the baseline model, which is a pre-trained seq2seq Transformer model that has been widely used in many NLP applications. We use the open-sourced T5X, which is a new and improved implementation of T5 codebase in JAX and Flax. Specifically, we use the official pretrained T5-large (770 million parameters) and finetuned it on three datasets respectively. By evaluating the performance metrics for each task, we find that the T5 model is capable of achieving the state-of-the-art results on all the tasks compared to previous work (see Table 3 in Appendix F for full comparison).

3.1. EVALUATING COMPOSITIONAL UNCERTAINTY UNDER DISTRIBUTIONAL SHIFT

Comparing Compositional ECE to vanilla ECE. Table 1 reports the evaluation results on the benchmark. First, we find that sequence accuracy (ACC seq ) does not necessarily correlate to the SMATCH score, which makes the vanilla ECE based on sequence accuracy, i.e., ECE seq , less informative in reflecting model's calibration on predicting graph structures. Second, comparing to Redwoods (in-domain) which requires parsing a natural language sentence into a pre-defined grammar representation, and SNIPS which requires labeling intent slots for a natural language sentence, SM-CalFlow is more ambiguous and difficult as it involves complex dialogue histories and fine-grained intent slots (Andreas et al., 2020) . We notice that the CECE is larger for SMCalFlow on in-domain test, indicating that CECE is a better metrics to reflect this task ambiguity/difficulty. Finally, as the model generalization degrades across different domains for Redwoods, CECE also increases accordingly, indicating that CECE can reflect model's generalization under domain shift. Comparing Advanced Uncertainty Methods. In recent year, a variety of methods have been developed to improve the DNNs uncertainty quality for classification problems. Here, we are interested to understand if the benefit brought by those advanced methods to classification setting translates also to the graph parsing setting. In this section, we evaluate the performance of 6 uncertainty baselines on Redwoods across 8 different domains. We consider T5-Large (Raffel et al., 2020) as the base model, and select six methods based on their practical applicability for the base model. ), while for edge predictions, only little improvement is observed. This suggests that uncertainty estimation is structurally different for seq2seq graph parsing tasks compared to classification tasks, and further research is needed for designing better calibrated model with more focus on compositional uncertainty calibration.

3.2. PRACTICAL EFFECTIVENESS: UNCERTAINTY-GUIDED COLLABORATIVE SEMANTIC PARSING

Motivation. To further explore the correlation between model uncertainty and performance, we plot the histogram for the T5 model's probabilities verses the node/edge accuracies in Appendix H (Figure 5 ), where we find that low model probability generally corresponds to low model performance. This serves as a motivation for collaborative semantic parsing using composotional uncertainty, where the model is allowed to send a limited number of uncertain subgraphs for human review. This is a practical setting in lots of realistic scenarios, for example, in Figure 1 , the model can ask for clarification regarding the uncertain subgraph (dotted squared) via modeling the uncertainty score for each element in the parsed semantic graph. This process allows the system to collaborate with the users to avoid triggering unwanted actions, which cannot be achieved without properly quantifying compositional uncertainty over graph elements. Uncertainty-based Subgraph Selection. For a well-trained seq2seq parser in Section 3.1, we find uncertain subgraphs by ( 1) finding e uncertain nodes as root nodes based on ranked compositional uncertainty scores over the predicted graph; (2) tracing descendants from root nodes up to depth dfoot_0 . Results. The results are shown in Figure 4 . Due to space limitation, we only report results on d = 2 for Redwoods and SMCalflow, and d = 1 for SNIPS. Full results for different combinations of example numbers and depths are reported in Table 5 and Table 6 (Appendix I). We see that for all three tasks, uncertainty-based subgraph selection consistently outperforms random subgraph selection with average error reduction rates 13.64%, 24.45% and 52.11% respectively, and performs fairly close to Oracle with an average difference in prediction error as small as 0.33. This shows that compositional uncertainty is effective in detecting potential incorrect subgraph predicted by the model. Meanwhile, we notice that Oracle* performs better than Oracle, indicating that incorrect predicted subgraphs with high uncertainty are more informative to the collaborative model. Analysis. Theoretically, the performance of collaborative parsing is determined by how many incorrectly predicted subgraphs can be selected for human edits, where Oracle is the headroom given limited budgets. In Table 7 (Appendix I), We further conduct an analysis to subgraphs selected by different strategies by calculating the coverage rate of incorrect nodes in subgraphs to incorrect nodes in the entire graph (i.e., error node coverage rate). The results indicates that compositional uncertainty is effective in detecting incorrectly predicted nodes.

4. CONCLUSION AND FUTURE WORK

Over the past few years, lots of efforts have been made to apply seq2seq models to graph parsing, which is an important area in NLP. Despite achieving the state-of-the-art in various graph parsing tasks, these seq2seq approaches pose a challenge on how to interpret model's predictive uncertainty on predicting graph structures. This work is the first to provide a general method to properly quantify and evaluate compositional uncertainty for seq2seq graph parsing, which is achieved by a simple probabilistic framework (GAP) and a rigorous metric (CECE). The experimental evaluation demonstrates that CECE is an effective metric to reflect distributional shift and compositional uncertainty is a useful tool for downstream tasks such as collaborative semantic parsing. 

A RELATED WORK

A.1 SEQ2SEQ GRAPH PARSING Seq2seq graph parsing is inspired by the success of recent seq2seq models (particularly pretrained models), which are the heart of modern neural machine translation. This type of parser encodes and views a target graph as a string from another language (Vinyals et al., 2015) . However, simply applying seq2seq models to graph parsing is not always successful when the target graph is complecated, e.g., for Abstract Meaning Representation (AMR; Banarescu et al., 2013) or English Resource Grammar (ERG; Flickinger et al., 2014) . This is because effective linearization (encoding graphs as linear sequences) and data sparsity were thought to pose significant challenges (Konstas et al., 2017) . Alternatively, some specifically designed preprocessing procedures for vocabulary and entities can help to address these issues (Konstas et al., 2017; Peng et al., 2017) . These preprocessing procedures are very specific to a certain type of meaning representation and are difficult to transfer to others. To address this, (Bevilacqua et al., 2021) propose to use special tokens to represent variables in the linearized graph and to handle co-referring nodes. Lin et al. (2022b) propose a variable-free top-down linearization and a compositionality-aware tokenization for ERG graph preprocessing, and successfully transfer the ERG parsing into a translation problem that can be solved by a state-of-the-art seq2seq model T5 (Raffel et al., 2020) . Their parser achieves the best known results on the in-domain test set for ERG parsing.

A.2 UNCERTAINTY QUANTIFICATION FOR GRAPH PARSING

Compared to seq2seq graph parsing, uncertainty quantification is straightforward if the parser explicitly models the target graph structures, e.g., chart parsers (Magerrnan & Marcus, 1991) , factorization-based parsers (McDonald, 2006; Cao et al., 2021) or composition-based parsers (Chen et al., 2018; 2019) , given that the model's score function is naturally aligned with the graph structure. As for transition-based parser (Fernandez Astudillo et al., 2020; Zhou et al., 2021) , where the target graph is generated via a series of actions, in a process that is very similar to dependency tree parsing (Yamada & Matsumoto, 2003; Nivre, 2008) , some previous work has used important sampling to estimate probabilities (Dyer et al., 2016) , and model uncertainty for alignments between graph nodes and input text tokens (Drozdov et al., 2022) . These works are very specific to the formalism of the target graph, and it is difficult to transfer to other graph parsing problems. Some previous uncertainty quantification methods have focused on sequential or token-level uncertainty for seq2seq model. For example, Dong et al. (2018) model uncertainty for neural semantic parsers by outlining three major causes of uncertainty including model uncertainty, data uncertainty and input uncertainty, and design various metrics to quantify these factors. Lin et al. (2022b) uses predictive probability generated by the T5 model as a signal for neural-symbolic parsing. However, these uncertainty quantification methods cannot model compositional uncertainty over the graph structures.

B PENMAN NOTATION

PENMAN notation, originally called Sentence Plan Notation in the PENMAN project (Kasper, 1989) , is a serialization format for the directed, rooted graphs used to encode semantic dependencies, 

Graph PENMAN Graph

most notably in the Abstract Meaning Representation (AMR) framework (Banarescu et al., 2013) . It looks similar to Lisp's S-Expression in using parentheses to indicate nested structures. To make PENMAN notation compatible with the seq2seq learning, we adopted a variable-free version of PENMAN which was first proposed in Lin et al. (2022b) . The general template is illustrated as follows: The linearized form can only describe projective structures such as trees, so in order to capture nonprojective graphs, this notation (1) reverse some of the edges to make it can be written in top-down tree order, e.g., :E3-of here (2) use star markers to indicate a node referred later to establish a reentrancy, e.g., E6 * here. Table 2 shows some variable-free PENMAN linearized examples for different semantic parsing tasks.

Redwoods

The Pentagon foiled the plan. ( foil v 1 :ARG1 ( named :carg "Pentagon" :BV-of ( the q ) ) :ARG2 ( plan n 1 :BV-of ( the q ) ) ) SMCalflow User: What time on Tuesday is my planning meeting? ( start :ARG1 ( findEvent :ARG1 ( EventSpec :name "planning" :start ( Timespec :weekday "tuesday" ) ) ) )

SNIPS

Find a movie called Living in America. ( IN:SEARCH CREATIVE WORK :ARG1 ( SL:OBJECT TYPE :carg "movie" ) :ARG2 ( SL:OBJECT NAME :carg "living in america" ) ) Table 2 : Examples for variable-free PENMAN linearized graph (template can be found in Appendix B) in three different semantic parsing tasks (task details can be found in Section 3). Here :carg means corresponding spans in the sentence.

C SIMPLIFIED EXPRESSION FOR GRAPHICAL MODEL LIKELIHOOD

Given the candidates graphs {G k } K k=1 , we can express the likelihood for p(v| pa(v), x) by writing down a multinomial likelihood enumerating over different values of pa(v) (Murphy, 2012) . For example, say pa(n) = (e 1 , e 2 ) which represents a subgraph of two edges (e 1 , e 2 ) pointing into a node n. Then the conditional probability p(n| pa(n), x) can be computed by enumerating over the observed values of (e 1 , e 2 ) pair: p(n| pa(n), x) = p(n|(e 1 , e 2 ), x) ∝ c∈Candidate(e1,e2) p(n|(e 1 , e 2 ) = c, x) Kc (9) where Candidate(e) is the collection of possible symbols s the variable e can take, and K c is the number of times (e 1 , e 2 ) takes a particular value c ∈ Candidate(e 1 , e 2 ) = Candidate(e 1 ) × Candidate(e 2 ). Then, the log likelihood becomes: log p(n| pa(n), x) = c K c * log p(n|(e 1 , e 2 ) = c) To simplify this above expression, we notice that log p(n| pa(n), x) can be divided by the constant beam size K without impacting the inference. As a result, the log probability can be computed by simplify averaging the values of log p(v| pa(v) = c k ) across the beam candidates: log p(n| pa(n), x) ∝ c K c K log p(n|(e 1 , e 2 ) = c) = 1 K K k=1 log p(n|(e 1 , e 2 ) = c k ) (11) where c k is the value of (e 1 , e 2 ) in k th beam candidate.

D OOD DATASETS FOR ERG PARSING

Wikipedia (Wiki) The DeepBank team constructed a treebank for 100 Wikipedia articles on Computational Linguistics and closely related topics. The treebank of 11,558 sentences comprises 16 sets of articles. The corpus contains mostly declarative, relatively long sentences, along with some fragments. The Brown Corpus (Brown) The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. The Eric Raymond Essay (Essay) The treebank is based on translations of the essay "The Cathedral and the Bazaar" by Eric Raymond. The average length and the linguistic complexity of these sentences is markedly higher than the other treebanked corpora. E-commerce While the ERG was being used in a commercial software product developed by the YY Software Corporation for automated response to customer emails, a corpus of training and test data was constructed and made freely available, consisting of email messages composed by people pretending to be customers of a fictional consumer products online store. The messages in the corpus fall into four roughly equal-sized categories: Product Availability, Order Status, Order Cancellation, and Product Return. Meeting/hotel scheduling (Verbmobil) This dataset is a collection of transcriptions of spoken dialogues, each of which reflected a negotiation either to schedule a meeting, or to plan a hotel stay. One dialogue usually consists of 20-30 turns, with most of the utterances relatively short, including greetings and closings, and not surprisingly with a high frequency of time and date expressions as well as questions and sentence fragments. Norwegian tourism (LOGON) The Norwegian/English machine translation research project LO-GON acquired for its development and evaluation corpus a set of tourism brochures originally written in Norwegian and then professionally translated into English. The corpus consists almost entirely of declarative sentences and many sentence fragments, where the average number of tokens per item is higher than in the Verbmobil and E-commerce data. The Tanaka Corpus (Tanaka) This treebank is based on parallel Japanese-English sentences, which was adopted to be used with in the WWWJDIC dictionary server as a set of example sentences associated within words in the dictionary.

E GRAPH MATCHING ALGORITHM IN SMATCH

In general, finding the largest common subgraph is a well-known computationally intractable problem in graph theory. However, for graph parsing problems where graphs have labels and a simple tree-like structure, some efficient heuristics are proposed to approximate the best match by a hillclimbing algorithm (Cai & Knight, 2013) . The initial match is modified iteratively to optimize the total number of matches with a predefined number of iterations (default value set to 5). This algorithm is very efficient and effective, it was also used to calculate the SMATCH score in Cai & Knight (2013) . F COMPARING T5 TO PREVIOUS WORK 

G FULL RESULTS FOR COMPARING UNCERTAINTY BASELINES

In Table 4 , we reported the full results for comparing different uncertainty baselines on the benchmark introduced in Section 3.

H HISTOGRAM OF CALIBRATIONS

The correlations between the subgraph's probability and performance on ERG parsing are shown in Figure 5 , where we can see that low model probability generally corresponds to low model performance, i.e., the model is relatively calibrated in predicting graph structures.

I FULL RESULTS FOR COLLABORATIVE SEMANTIC PARSING

In Table 5 and Table 6 , we report the full results for collaborative performance under different settings of budgets of subgraphs by different combinations of number of subgraphs (e) and max depth of each subgraphs (d). Note that for SNIPS, since the performance is almost close to 100% when selecting number of subgraphs greater than 3, i.e., the total graph has been covered, we will not evaluate cases where e > 3. We can see from the table that uncertainty-based subgraph selection consistently outperforms random subgraph selection, and performs close to oracle subgraph selection. In Table 7 , We further conduct an analysis to subgraphs selected by different strategies by calculating the coverage rate of incorrect nodes in subgraphs to incorrect nodes in the entire graph (i.e., error node coverage rate). The results indicates that compositional uncertainty is effective in detecting incorrectly predicted nodes.

J LIMITATIONS

Here we discuss some potential limitations of the current study: Linguistic Breadth The GAP model in this work is a general uncertainty quantification framework for graph parsing problems using seq2seq model, which theoretically has no restriction to formalism and languages adopted for the output graph. Navigli et al., 2022) , and it is interesting to explore the model behavior on these graph formalism in terms of compositional uncertainty.

Graphical Model Specification

The GAP model presented in this work considers a classical graphical model likelihood p(G|x) = v∈G p(v| pa(v), x) , which leads to a clean factorization between graph elements v and fast probability computation. However, it also assumes a local Markov property that v is conditional independent to its ancestors given the parent pa(v). In theory, the probability learned by a seq2seq model is capable of modeling higher order conditionals between arbitrary elements on the graph. Therefore it is interesting to ask if a more sophisticated graphical model with higher-order dependency structure can lead to better performance in practice while maintaining reasonable computational complexity. 



Here the number of subgraphs e and the maximum depth d are our review capacity for uncertain subgraphs (in experiment, we try e ∈ [1, 3, 5] and d ∈ [1, 2, 3]). The model is of the same setting as in Section 3.1. Here we take gold subgraphs as human edit results.



Figure 1: An example for graph semantic parsing in a dialogue system. The output of the model is a linearized semantic graph (left) that corresponds to the graph structure (right). The dotted square indicates the uncertain part, based on which the model can ask a clarification question.

Figure 3: Evaluation results for different uncertainty baselines in terms of SMATCH score, ECE seq , CECE G , CECE N , and CECE E under distributional shift.an ensemble method which has much lower computational and memory costs comparing to MC Dropout and Deep Ensemble(Wen et al., 2019); (5) Spectral-normalized Neural Gaussian Process (SNGP), a recent state-of-the-art approach which improves uncertainty quality by transforming a neural network into an approximate Gaussian process model(Liu et al., 2020); (6) SNGP+DE which is the deep ensemble for 4 individual SNGP models; (7) SNGP+BE which uses a combination of Batch Ensemble and SNGP layers. The results are shown in Figure3. The full evaluation results can be found in Appendix G (Table4).

Figure 5: Diagrams for the T5 model's probabilities verses the T5 model's accuracies at subgraph level (nodes and edges). Each bin contains the same number of examples. Since at most of the subgraphs, the model is pretty certain (log P > -1e-5), we exclude these pretty certain predictions in the figures.

• • • y i N i }. This process is illustrated as follows:To this end, GAP assigns probability to each linearized graph g = s 1 s 2 • • • s L autoregressively as p(g|x) = L i=1 p(s i |s <i , x), and the conditional probability p(s i |s <i , x) is computed by aggregating the token probability: p(s i |s <i , x) = p({y i1 • • • y i N i }|s <i , x) = , s i→j |s i , s <i , x)p(s i |s <i , x)p(s <i |x)ds i→j ds <i



Evalution results on three tasks. Acc seq means sequence accuracy (exact match); F1 N /F1 E means F1 scores for nodes/edges; CECE G /CECE N /CECE E means compositional ECE for graph/nodes/edges. Since we use pseudo edges to transfer examples to graphs for SNIPS, we skip edge-related evaluations for SNIPS. The background color in calibration metrics indicates the number order in column (Green: Min; Red: Max). labeling problem. Following previous work(Yu et al., 2021), we train models on five source domains, use a sixth one for development, and test on the remaining domian.

Table 4). Results. As shown in the figure, these uncertainty baselines generally follows the same pattern on domain shift, i.e., decrease in SMATCH corresponds to increase in CECE G , while we cannot infer distributional shift via ECE seq . Some uncertainty baselines (e.g., MC Dropout and SNGP+DE) can achieve better in both ECE seq and CECE G compared to deterministic model across different domains, where MC Dropout achieves the best results in ECE seq and SNGP+DE achieves the best results in CECE G . By further evaluating CECE N and CECE E , we notice that the improvement in CECE G mainly comes from node predictions (the difference in CECE N is more obvious than CECE E

Figure 4: Collaborative performance for different semantic parsing tasks. The performance metrics are SMATCH score for Redwoods, sequence accuracy for SMCalflow and slot (node) F1 for SNIPS Baselines for Comparision. We take another three subgraph selections as comparision: (1) Random subgraph selection by randomly finding e nodes as root nodes and tracing descendants from roots up to depth d; (2) Oracle subgraph selection by finding e incorrectly predicted nodes as root nodes and tracing descendants from roots up to depth d; (3) Oracle* subgraph selection by finding e incorrectly predicted nodes (prioritizing the most uncertain nodes) and tracing descendants from roots up to depth d. Training and Inference. The training examples for the collaborative model are generated by attaching human edit results of random subgraphs to input sentences 3 . During the inference, the test examples are generated by attaching corresponding human edit results for each subgraph selection strategy to input sentences.

Yaliang Li, Nan Du, Wei Fan, and Philip Yu. Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5259-5267, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1519. URL https://aclanthology.org/ P19-1519.

shows the in-domain performance of the T5 model compare to previous work on Redwoods, SMCalflow and SNIPS. The results indicate that the T5 model is capable of achieving the state-ofthe-art results on all the tasks compared to previous work. Interestingly, we notice that by rewriting SMCalFlow into PENMAN notation, the sequence accuracy increase from 72.9 to 82.8 on development set, indicating that proper linearization and tokenization are important for graph parsing tasks.

In-domain performance evaluation for three semantic parsing tasks. The T5 model is built based onLin et al. (2022b) on Redwoods and it surprisingly achieves the state-of-the-art results on the other datasets (SMCalflow and SNIPS).

In this work, we have tested GAP on Redwoods, SMCalflow and SNIPS, which are all English based, but it is worth to see how the approachACC seq (↑) SMATCH(↑) F1 N (↑) F1 E (↑) ECE seq (↓) CECE G (↓) CECE N (↓) CECE E (↓)Evaluations for different uncertainty baselines on different domains in Redwoods treebanks. BE refers to Batch Ensemble, and DE refers to Deep Ensemble.



Full collaborative parsing performance on SMCalflow and SNIPS, the performance metrics are sequence accuracy for SMCalflow and slot (node) F1 for SNIPS.

ACKNOWLEDGMENTS

We appreciate the insightful comments from the anonymous reviewers. We would like to thank Jie Ren for discussion and proofreading.

availability

https://github

