DECAF: JOINT DECODING OF ANSWERS AND LOGICAL FORMS FOR QUESTION ANSWERING OVER KNOWLEDGE BASES

Abstract

Question answering over knowledge bases (KBs) aims to answer natural language questions with factual information such as entities and relations in KBs. Previous methods either generate logical forms that can be executed over KBs to obtain final answers or predict answers directly. Empirical results show that the former often produces more accurate answers, but it suffers from non-execution issues due to potential syntactic and semantic errors in the generated logical forms. In this work, we propose a novel framework DECAF that jointly generates both logical forms and direct answers, and then combines the merits of them to get the final answers. Moreover, different from most of the previous methods, DECAF is based on simple free-text retrieval without relying on any entity linking tools -this simplification eases its adaptation to different datasets. DECAF achieves new stateof-the-art accuracy on WebQSP, FreebaseQA, and GrailQA benchmarks, while getting competitive results on the ComplexWebQuestions benchmark. 1

1. INTRODUCTION

Knowledge Bases Question Answering (KBQA) aims to answer natural language questions based on knowledge from KBs such as DBpedia (Auer et al., 2007) , Freebase (Bollacker et al., 2008) or Wikidata (Vrandečić & Krötzsch, 2014) . Existing methods can be divided into two categories. One category is based on semantic parsing, where models first parse the input question into a logical form (e.g., SPARQL (hommeaux, 2011) or S-expression (Gu et al., 2021) ) then execute the logical form against knowledge bases to obtain the final answers (Das et al., 2021; Gu et al., 2021; Ye et al., 2022) . The other category of methods directly outputs answers without relying on the the logicalform executor (Lan et al., 2019; Sun et al., 2019; Saxena et al., 2022; Oguz et al., 2022) . They either classify the entities in KB to decide which are the answers (Sun et al., 2019) or generate the answers using a sequence-to-sequence framework (Saxena et al., 2022; Oguz et al., 2022) . Previous empirical results (Ye et al., 2022; Das et al., 2021; Gu et al., 2022) show that the semantic parsing based methods can produce more accurate answers over benchmark datasets. However, due to the syntax and semantic restrictions, the output logical forms can often be non-executable and thus would not produce any answers. On the other hand, direct-answer-prediction methods can guarantee to generate output answers, albeit their answer accuracy is usually not as good as semantic parsing based methods, especially over complex questions which require multi-hop reasoning (Talmor & Berant, 2018) . To our knowledge, none of the previous studies have leveraged the advantages of both types of methods. Moreover, since knowledge bases are usually large-scale with millions of entities, most previous methods rely on entity linking to select relevant information from KB for answering questions. However, these entity linking methods are usually designed for specific datasets, which inevitably limits the generalization ability of these methods. In this paper, we propose a novel framework DECAF to overcome these limitations: (1) Instead of relying on only either logical forms or direct answers, DECAF jointly decodes them together, and further combines the answers executed using logical forms and directly generated ones to obtain the final answers. Thus the advantages of both methods can be leveraged in our model. Moreover, unlike previous methods using constrained decoding (Chen et al., 2021a) or post revision (Das et al., 2021) to produce more faithful logical forms, we simply treat logical forms as regular text strings just like answers during generation, reducing efforts of hand-crafted engineering. (2) Different from previous methods which rely on entity linking (Yih et al., 2015; Li et al., 2020) to locate entities appeared in questions and then retrieve relevant information from KB, DECAF linearizes KBs into text documents and leverages free-text retrieval methods to locate relevant sub-graphs. Based on this simplification, DECAF brings better adaptation to different datasets and potentially different KBs due to the universal characteristic of text-based retrieval. Experiments show that simple BM25 retrieval brings surprisingly good performance across multiple datasets. We conduct experiments on four benchmark datasets including WebQSP (Yih et al., 2016) , Com-plexWebQuestions (Talmor & Berant, 2018 ), FreebaseQA (Jiang et al., 2019 ), and GrailQA (Gu et al., 2021) . Experiment results show that our model achieves new state-of-the-art results on We-bQSP, FreebaseQA, and GrailQA benchmarks, and gets very competitive results on the ComplexWe-bQuestions benchmark. This demonstrates the effectiveness of DECAF across different datasets and question categories.

2. RELATED WORK

Semantic parsing based methods for KBQA first parse the input question into a logical form (LF) then execute it against KB to obtain the answers. ReTrack (Chen et al., 2021a) uses a grammarbased decoder to generate LFs based on pre-defined grammar rules, and a semantic checker to discourage generating programs that are semantically inconsistent with KB. RnG-KBQA (Ye et al., 2022) first enumerates possible LFs based on the entity in the input question. Then a ranking-andgeneration framework is applied to output the final LF. ArcaneQA (Gu & Su, 2022) generates the LFs dynamically based on the execution results of LFs generated at intermediate steps. TIARA (Shu et al., 2022) proposes a multi-grained retrieval method to select relevant KB context for logical form generation. All these methods rely on an external executor such as a SPARQL server to execute LFs for final answers. If the LFs are not executable, then no answers will be produced. Direct-Answer-Prediction methods for KBQA directly output answers without relying on the LF executor. PullNet (Sun et al., 2019) retrieves a subgraph of KB related to the input question and applies graph neural networks to predict the answer entities in the subgraphs. KGT5 (Saxena et al., 2022) uses a sequence-to-sequence framework to directly generate answers only based on the input question. UniK-QA (Oguz et al., 2022) is also based on a sequence-to-sequence framework, but it first retrieves relevant triplets from KB and then generates answers based on the combination of the input question and retrieved triplets. Although answers can always be produced without the need of LF executor, this type of method usually underperforms semantic parsing based methods on public benchmarks (Talmor & Berant, 2018; Gu et al., 2021; 2022) . 2022) also shows that entity linking models are usually domain-specific and hard to transfer across domains. Different from these methods, DECAF reduces this burden by linearizing KBs to text documents and leveraging simple text-retrieval methods. Experimental results show that this is not only more general but also empirically effective. Similar to our method, UniK-QA (Oguz et al., 2022 ) also linearizes the KB and conducts retrieval. However, UniK-QA still requires entity linking (Yih et al., 2015) to reduce the retrieval range, and it only generates direct answers without studying logical forms for questions requiring complex reasoning. Besides the studies above, UnifiedSKG (Xie et al., 2022) is relevant since it studies the generation of logical forms and direct answers for KBQA. However, it does not study combining the advantages of both logical forms and direct answers, and further assumes that the ground-truth question entities are provided, which dramatically eases this



Linking & Knowledge Linearization. Real-world KBs are usually very large and with millions of entities and triplets. The algorithm to ground the input question onto a relevant subgraph of KB is important. Entity linking is the most common way for this. CBR-KBQA (Das et al., 2021) combines an off-the-shelf model ELQ (Li et al., 2020) with an NER system provided by Google Cloud API for entity linking. RnG-KBQA (Ye et al., 2022) also uses ELQ for the WebQSP dataset, while it uses a BERT-based (Devlin et al., 2019) NER system and trains another BERTbased entity disambiguation model for the GrailQA dataset. Previous works usually design different methods to optimize the performance of different datasets. A recent study Soliman et al. (

funding

* Work done during internship at AWS AI Labs 1

availability

Our code is available at https://github.

