CHEMISTRYQA: A COMPLEX QUESTION ANSWERING DATASET FROM CHEMISTRY

Abstract

Many Question Answering (QA) tasks have been studied in NLP and employed to evaluate the progress of machine intelligence. One kind of QA tasks, such as Machine Reading Comprehension QA, is well solved by end-to-end neural networks; another kind of QA tasks, such as Knowledge Base QA, needs to be translated to a formatted representations and then solved by a well-designed solver. We notice that some real-world QA tasks are more complex, which cannot be solved by end-to-end neural networks or translated to any kind of formal representations. To further stimulate the research of QA and development of QA techniques, in this work, we create a new and complex QA dataset, ChemistryQA, based on real-world chemical calculation questions. To answer chemical questions, machines need to understand questions, apply chemistry and math knowledge, and do calculation and reasoning. To help researchers ramp up, we build two baselines: the first one is BERT-based sequence to sequence model, and the second one is an extraction system plus a graph search based solver. These two methods achieved 0.164 and 0.169 accuracy on the development set, respectively, which clearly demonstrates that new techniques are needed for complex QA tasks. ChemistryQA dataset will be available for public download once the paper is published.

1. INTRODUCTION

Recent years have witnessed huge advances for the question answering (QA) task, and some AI agents even beat human beings. For example, IBM Watson won Jeopardy for answering questions which requires a broad range of knowledge (Ferrucci, 2012) . Transformer-based neural models, e.g. XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019) , beat human beings on both machine reading comprehension and conversational QA task. Ariso System (Clark et al., 2019) gets an 'Ace' for an eighth-grade science examination and is able to give 80 percent correct answers for 12th-grade science test. Most solutions of the QA task fall into two categories, end-to-end solution and parsing plus execution. The former predicts answers with an end-to-end neural network, e.g., Reading comprehension QA (Rajpurkar et al., 2016; 2018; Lai et al., 2017) and Science Exam QA (Clark et al., 2019; 2018) . The latter translates a question into a specific structural form which is executed to get the answer. For example, in knowledge-based question answering (KBQA) (Berant et al., 2013; Yih et al., 2016; Saha et al., 2018) questions are parsed into SPARQL-like queries consisting of predicates, entities and operators. In Math Word Problem (Huang et al., 2016; Amini et al., 2019) questions are translated to stacks of math operators and quantities. However, in the real world, many QA tasks cannot be solved by end-to-end neural networks and it is also very difficult to translate questions into any kind of formal representation. Solving Chemical Calculation Problems is such an example. Chemical Calculation Problems cannot be solved by end-to-end neural networks since complex symbolic calculations are required. It is also difficult to translate such problems into formal representations, since not all operators in solving processes occur in question stems, which makes it difficult to annotate data and train models. Table 1 shows a question in ChemistryQA. To answer the question in Table 1 , machines need to: 1) understand the question and extract variable to be solved and conditions in the question; 2) retrieve and apply related chemistry knowledge, including calculating molarity by mole and volume, balancing a chemical equation and calculating the equilibrium constant K, although there is no explicit statement

