DISAMBIGUATING SYMBOLIC EXPRESSIONS IN INFORMAL DOCUMENTS

Abstract

We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of L A T E X files -that is, determining their precise semantics and abstract syntax tree -as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid L A T E X before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.

1. INTRODUCTION

Despite huge advancements in machine learning, the task of understanding informal reasoning is still beyond current methods. In fact, it became commonplace that humans annotate informal documents containing reasoning in many domains, e.g. law (Libal & Steen, 2020) . Reasoning is most visible in mathematical documents and software specification and as such in the last decades, the formalization of mathematical knowledge, and the verification of formal proofs, has become increasingly popular. By now, dozens of interactive and automated theorem prover systems are available, each providing libraries with up to hundreds of thousands of formalizations of mathematical definitions, theorems, and their proofs written by human mathematicians (Harrison et al., 2014) . While formal methods are still primarily used by computer scientists (e.g. to verify software and hardware, as well as in program synthesis), by now they have also drawn the interest of an increasing number of research mathematicians -primarily thanks to famous problems such as Kepler's conjecture (Hales et al., 2017) or the classification theorem for finite simple groups (Solomon, 1995), which have successfully been verified using theorem prover systems. However, while some mathematicians have begun actively adapting formal methods for their work, there is a prohibitively large discrepancy between the way new mathematical results are developed, presented, and published in mathematical practice, and the way they are formalized and implemented in formal systems (Kaliszyk & Rabe, 2020): Most theorem proving systems implement a fixed logical foundation (such as variants of set theory or various kinds of type theories), a surface syntax in which a user declares new definitions and statements in terms of the underlying foundations, and either a tactic language or a language for expressing proof terms (usually on basis of the Curry-Howardcorrespondence in a typed λ-calculus) that allow for declaring proofs. Consequently, the process of formalizing new content in a formal system resembles programming much more than it does developing informal proofs. This discrepancy results in severe challenges for traditional mathematicians: Formal systems are difficult to learn and use, even if one is well acquainted with the (informal) mathematics involved. They require learning dedicated formal languages resembling programming languages, declaring content on a level of detail that is prohibitive for beginners even for "obvious" conclusions, and their libraries are difficult to grasp without already being familiar with the system's language, conventions and functionalities. Due to the required level of detail, knowledge of the existing libraries is crucial when formalizing new content. Furthermore, many "intuitively valid" arguments can not be easily expressed in terms of a logical foundation in the first place, and knowing how to deal with those requires familiarity with the logical foundation involved and lots of practice. Consequently, the utility of formalizing mathematical results can be too easily (and too often is) dismissed in light of the additional time and work required for non-experts. This is despite the fact that many services available for formal mathematics are already enabled by semi-formal (or flexiformal) representations, such as semantic annotations in natural language texts, or formal representations containing opaque informal expressions (see e.g. Kohlhase (2013); Lange (2011a); Iancu ( 2017 2016)). Therefore, we need to invest into methods for bridging the gap between informal mathematical practice and (semi-)formal mathematics. One way to do so is to investigate autoformalization, the task of (semi-automatically) converting existing informal mathematical presentations to (increasingly) formal representations. Notably, these issues extend beyond pure mathematics to other STEM (science, technology, engineering and math) fields, where the formal verification (or lack thereof) of results can have direct real-world implications -examples include an infamous and costly error in the floating-point unit of Intel processors (Harrison, 2003) and several human failures to adequately convert between SI and imperial units, most famously in NASA's Mars orbiter (Grossman). In fact, the former has already established formal verification as a vital tool in hardware design (Harrison, 2003) . Two observations motivate the research presented here: 1. The vast majority of STEM researchers can be assumed to be comfortable with using L A T E X; any integration of formal methods in a L A T E X development environment (e.g. via new packages or IDE integration) would consequently lower the entry barrier significantly. 2. The task of going from purely informal mathematical texts to fully formal representations of the contained knowledge is best done via a separation of concerns, by focussing on individual subtasks (such as disambiguating symbolic expressions, parsing natural language, and translating it to a formal foundation) using dedicated tools for each. In this paper, we discuss specifically the task of disambiguating symbolic expressions -i.e. associating all symbols in an expression with their precise semantics -in L A T E X documents as a machine learning task, using sT E X semantically annotated L A T E X (Kohlhase, 2008) . The contributions are threefold: 1. We discuss the details of disambiguating symbolic expressions in informal STEM documents as a neural machine translation task, 2. we present a new dataset specifically for this task, based on the existing SMGLoM library of sT E X macros (see Subsection 2.2), and 3. we present a methodology (using transformer language models) that allows us to achieve positive results on our dataset. We previously evaluated several baseline NMT models (such as Luong et al. (2017); Vaswani et al. (2017) and a plain character-based sequence-to-sequence model), which all failed to yield meaningful results due to our dataset being considerably smaller than is required for traditional NMT models.foot_0 

2. PRELIMINARIES

By disambiguating, we mean the task of transforming a sequence of symbols (representing a mathematical formula) into an abstract syntax tree and associating each leaf in the tree with a unique identifier specifying the precise semantics of the corresponding symbol. While this might superficially seem an easy task, closer consideration shows that even obvious seeming statements such as "a + b" can in fact correspond to a multitude of possible disambiguations: a and b can be variables or previously defined constants, whereas + can represent e.g. addition on multiple different number spaces, generic ring or vector space operations, or string concatenation. In order to adequately disambiguate expressions generically, it is, therefore, necessary to take the context in which the expression occurs into account.



All code and data relevant to this paper is available at https://gl.kwarc.info/dmueller/ fifom.



); Kohlhase et al. (2017a); Corneli & Schubotz (2017); Dehaye et al. (

