CHEMISTRYQA: A COMPLEX QUESTION ANSWERING DATASET FROM CHEMISTRY

Abstract

Many Question Answering (QA) tasks have been studied in NLP and employed to evaluate the progress of machine intelligence. One kind of QA tasks, such as Machine Reading Comprehension QA, is well solved by end-to-end neural networks; another kind of QA tasks, such as Knowledge Base QA, needs to be translated to a formatted representations and then solved by a well-designed solver. We notice that some real-world QA tasks are more complex, which cannot be solved by end-to-end neural networks or translated to any kind of formal representations. To further stimulate the research of QA and development of QA techniques, in this work, we create a new and complex QA dataset, ChemistryQA, based on real-world chemical calculation questions. To answer chemical questions, machines need to understand questions, apply chemistry and math knowledge, and do calculation and reasoning. To help researchers ramp up, we build two baselines: the first one is BERT-based sequence to sequence model, and the second one is an extraction system plus a graph search based solver. These two methods achieved 0.164 and 0.169 accuracy on the development set, respectively, which clearly demonstrates that new techniques are needed for complex QA tasks. ChemistryQA dataset will be available for public download once the paper is published.

1. INTRODUCTION

Recent years have witnessed huge advances for the question answering (QA) task, and some AI agents even beat human beings. For example, IBM Watson won Jeopardy for answering questions which requires a broad range of knowledge (Ferrucci, 2012) . Transformer-based neural models, e.g. XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019) , beat human beings on both machine reading comprehension and conversational QA task. Ariso System (Clark et al., 2019) gets an 'Ace' for an eighth-grade science examination and is able to give 80 percent correct answers for 12th-grade science test. Most solutions of the QA task fall into two categories, end-to-end solution and parsing plus execution. The former predicts answers with an end-to-end neural network, e.g., Reading comprehension QA (Rajpurkar et al., 2016; 2018; Lai et al., 2017) and Science Exam QA (Clark et al., 2019; 2018) . The latter translates a question into a specific structural form which is executed to get the answer. For example, in knowledge-based question answering (KBQA) (Berant et al., 2013; Yih et al., 2016; Saha et al., 2018) questions are parsed into SPARQL-like queries consisting of predicates, entities and operators. In Math Word Problem (Huang et al., 2016; Amini et al., 2019) questions are translated to stacks of math operators and quantities. However, in the real world, many QA tasks cannot be solved by end-to-end neural networks and it is also very difficult to translate questions into any kind of formal representation. Solving Chemical Calculation Problems is such an example. Chemical Calculation Problems cannot be solved by end-to-end neural networks since complex symbolic calculations are required. It is also difficult to translate such problems into formal representations, since not all operators in solving processes occur in question stems, which makes it difficult to annotate data and train models. Table 1 shows a question in ChemistryQA. To answer the question in Table 1 , machines need to: 1) understand the question and extract variable to be solved and conditions in the question; 2) retrieve and apply related chemistry knowledge, including calculating molarity by mole and volume, balancing a chemical equation and calculating the equilibrium constant K, although there is no explicit statement for "calculating molarity" and "balancing equations" in the question. The combination of these capabilities is scarcely evaluated well by existing QA datasets. In order to foster the research on this area, we create a dataset of chemical calculation problems, namely ChemstriyQA. (g) + O 2 (g) → N 2 O(g). Knowledge K = [N2O] a [N2] b [O2] c , [ * We collect about 4,500 chemical calculation problems from https://socratic.org/ chemistry, covering more than 200 topics in chemistry. Besides the correct answer, we also label the target variable and conditions provided in a question. Such additional labels facilitate potential data augmentation and inferring golden solving process for training. To verify the dataset is consistent with the purpose of evaluating AI' comprehensive capability and help other researchers ramp up, we build two baselines as follows. a) We build a BERT based sequence to sequence model, which take the raw question as input and the answer as output. The first baseline achieves 0.164 precision on ChemistryQA. b) We create an extraction system which extracts the target variable and conditions from raw questions. The extracted structure information is fed into a graph searching based solver, which performs a sequence of calculating and reasoning to get the final answer. The second baseline achieves 0.169 precision on ChemistryQA. In summary, our contribution of this paper is shown as follows. • We propose a new QA task, ChemistryQA, which requires open knowledge and complex solving processes. ChemistryQA is different with other existing QA tasks, and cannot be solved by existing QA methods very well. • We create a ChemistryQA dataset, which contains about 4,500 chemical calculation problems and covers more than 200 topics in chemistry. In this dataset, we provide a novel annotation for questions, which only labels the variable asked and conditions from question stem but not solving processes. This annotation is much easier and cost less effort, and it is flexible for researchers to explore various of solutions as a weakly supervised dataset. • We build two baselines to prove: a) end-to-end neural networks cannot solve this task very well; b) the annotation we provided can be used to improve a simple graph search based solver. 2 CHEMISTRYQA DATASET

2.1. DATA COLLECTION

We collect chemical calculation problems from https://socratic.org/chemistry. It this website, there are more than 30,000 questions which cover about 225 chemistry related topics, e.g., Decomposition Reactions, Ideal Gas Law and The Periodic Table . There is an example of annotation page in Appendix A. Figure 2 .A shows the original page in Socratic, which contains a raw question,



An Example in ChemistryQA. Question At a particular temperature a 2.00 L flask at equilibrium contains 2.80 × 10 -4 mol N 2 , 2.50 × 10 -5 mol O 2 , and 2.00 × 10 -2 mol N 2 O. How would you calculate K at this temperature for the following reaction: N 2 (g) + O 2 (g) → is 2.80 × 10 -4 mol. Mole of O 2 is 2.50 × 10 -5 mol. Mole of N 2 O is 2.00 × 10 -2 mol. Reaction equation is N 2

] is the molarity of *, and a,b,c are coefficients of matters.

