DISAMBIGUATING SYMBOLIC EXPRESSIONS IN INFORMAL DOCUMENTS

Abstract

We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of L A T E X files -that is, determining their precise semantics and abstract syntax tree -as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid L A T E X before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.

1. INTRODUCTION

Despite huge advancements in machine learning, the task of understanding informal reasoning is still beyond current methods. In fact, it became commonplace that humans annotate informal documents containing reasoning in many domains, e.g. law (Libal & Steen, 2020) . Reasoning is most visible in mathematical documents and software specification and as such in the last decades, the formalization of mathematical knowledge, and the verification of formal proofs, has become increasingly popular. By now, dozens of interactive and automated theorem prover systems are available, each providing libraries with up to hundreds of thousands of formalizations of mathematical definitions, theorems, and their proofs written by human mathematicians (Harrison et al., 2014) . While formal methods are still primarily used by computer scientists (e.g. to verify software and hardware, as well as in program synthesis), by now they have also drawn the interest of an increasing number of research mathematicians -primarily thanks to famous problems such as Kepler's conjecture (Hales et al., 2017) or the classification theorem for finite simple groups (Solomon, 1995) , which have successfully been verified using theorem prover systems. However, while some mathematicians have begun actively adapting formal methods for their work, there is a prohibitively large discrepancy between the way new mathematical results are developed, presented, and published in mathematical practice, and the way they are formalized and implemented in formal systems (Kaliszyk & Rabe, 2020) : Most theorem proving systems implement a fixed logical foundation (such as variants of set theory or various kinds of type theories), a surface syntax in which a user declares new definitions and statements in terms of the underlying foundations, and either a tactic language or a language for expressing proof terms (usually on basis of the Curry-Howardcorrespondence in a typed λ-calculus) that allow for declaring proofs. Consequently, the process of formalizing new content in a formal system resembles programming much more than it does developing informal proofs. This discrepancy results in severe challenges for traditional mathematicians: Formal systems are difficult to learn and use, even if one is well acquainted with the (informal) mathematics involved. They require learning dedicated formal languages resembling programming languages, declaring content on a level of detail that is prohibitive for beginners even for "obvious" conclusions, and their libraries are difficult to grasp without already being familiar with the system's language, conventions and functionalities. Due to the required level of detail, knowledge of the existing libraries is crucial when formalizing new content. Furthermore, many "intuitively valid" arguments can not be easily expressed in terms of a logical foundation in the first place, and knowing how to deal with those requires familiarity with the logical foundation involved and lots of practice. Consequently, the utility of formalizing mathematical results can be too easily (and too often is) dismissed in light of the additional time and work required for non-experts. This is despite the fact that many services available for formal mathematics are already enabled by semi-formal (or flexiformal) representations, such as semantic annotations in natural language texts, or formal representations containing opaque informal expressions (see e.g. Kohlhase (2013) ; Lange (2011a) ; Iancu (2017) ; Kohlhase et al. (2017a) ; Corneli & Schubotz (2017) ; Dehaye et al. (2016) ). Therefore, we need to invest into methods for bridging the gap between informal mathematical practice and (semi-)formal mathematics. One way to do so is to investigate autoformalization, the task of (semi-automatically) converting existing informal mathematical presentations to (increasingly) formal representations. Notably, these issues extend beyond pure mathematics to other STEM (science, technology, engineering and math) fields, where the formal verification (or lack thereof) of results can have direct real-world implications -examples include an infamous and costly error in the floating-point unit of Intel processors (Harrison, 2003) and several human failures to adequately convert between SI and imperial units, most famously in NASA's Mars orbiter (Grossman) . In fact, the former has already established formal verification as a vital tool in hardware design (Harrison, 2003) . Two observations motivate the research presented here: 1. The vast majority of STEM researchers can be assumed to be comfortable with using L A T E X; any integration of formal methods in a L A T E X development environment (e.g. via new packages or IDE integration) would consequently lower the entry barrier significantly. 2. The task of going from purely informal mathematical texts to fully formal representations of the contained knowledge is best done via a separation of concerns, by focussing on individual subtasks (such as disambiguating symbolic expressions, parsing natural language, and translating it to a formal foundation) using dedicated tools for each. In this paper, we discuss specifically the task of disambiguating symbolic expressions -i.e. associating all symbols in an expression with their precise semantics -in L A T E X documents as a machine learning task, using sT E X semantically annotated L A T E X (Kohlhase, 2008) . The contributions are threefold: 1. We discuss the details of disambiguating symbolic expressions in informal STEM documents as a neural machine translation task, 2. we present a new dataset specifically for this task, based on the existing SMGLoM library of sT E X macros (see Subsection 2.2), and 3. we present a methodology (using transformer language models) that allows us to achieve positive results on our dataset. We previously evaluated several baseline NMT models (such as Luong et al. (2017); Vaswani et al. (2017) and a plain character-based sequence-to-sequence model), which all failed to yield meaningful results due to our dataset being considerably smaller than is required for traditional NMT models.foot_0 

2. PRELIMINARIES

By disambiguating, we mean the task of transforming a sequence of symbols (representing a mathematical formula) into an abstract syntax tree and associating each leaf in the tree with a unique identifier specifying the precise semantics of the corresponding symbol. While this might superficially seem an easy task, closer consideration shows that even obvious seeming statements such as "a + b" can in fact correspond to a multitude of possible disambiguations: a and b can be variables or previously defined constants, whereas + can represent e.g. addition on multiple different number spaces, generic ring or vector space operations, or string concatenation. In order to adequately disambiguate expressions generically, it is, therefore, necessary to take the context in which the expression occurs into account. In this paper, we consider informal documents in L A T E X specifically, which we will disambiguate with the sT E X package, using semantic identifiers provided by the SMGloM library. This eventually enables various formal knowledge management services (such as type/proof checking) provided by the MMT system.

2.1. STEX

Kohlhase proposed sT E X (Kohlhase, 2008) , a package for annotating L A T E X documents with structural and formal semantics which is today used by multiple groups formalizing mathematics in various systems. In particular, sT E X is based on OMDOC (Kohlhase, 2006) , an extension of OpenMath (Buswell et al., 2004) which is foundation-agnostic in the sense that it does not favor a specific foundation (such as type or set theories) over any other. This approach is consequently best suited for semantifying informal documents, where foundations are often unspecified, left implicit or switched fluently. For example, category-theoretic and set-theoretic formulations are often used interchangeably in algebraic settings, whereas type theories are generally favored for computational aspects and formal systems. Figure 1 shows example sT E X macros and their usage in various stages. Relevant for this paper is primarily the \symdef command, which introduces a new mathematical concept (e.g. \nattimes in Figure 1 ). It takes as arguments a macro name (e.g. nattimes), a symbolic notation (last argument) and optionally an OMDOC-name (e.g. multiplication), arity (e.g. [1] , which may be flexary) and notational precedence (e.g. p=600, for automatic bracketing). It generates a unique identifier for the concept being declared (based on the provided OMDOC-name), and a new L A T E X macro (e.g. \nattimes) for referring to the symbol. Alternative notational variants for symbols can be introduced via \symvariant, which are used as options to the macro (e.g. \nattimes [cdot] ). In addition to being valid L A T E X, compilable via pdflatex, sT E X-documents can be transformed to OMDOC using the LaTeXML-software (Ginev et al., 2011) , yielding a formally disambiguated representation of the document and in particular the symbolic expressions therein on the basis of the macros provided by \symdefs. LaTeXML also heuristically attempts to disambiguate non-sT E X-symbols, e.g. by considering "=" and "+" as infix notations for generic equality and addition operators, respectively.

2.2. SMGLOM

The SMGloM (Kohlhase, 2014) , semantic multilingual glossary of mathematics) is a library of hundreds of sT E X-modules containing mathematical concepts and definitions. It is separated into signature modules (using the modsig-environment, see Figure 1 ) containing only symbol declarations, and natural language modules (using the mhmodnl-environment, here exemplary for English) that serve as dictionary entries for these, in which the semantics of the symbols are described in a semi-formal manner. The second row of Figure 1 shows an SMGLoM entry.

2.3. MMT

sT E X itself is integrated, and shares an underlying OMDOC ontology, with the MMT system (Rabe & Kohlhase, 2013; Horozal et al., 2012; Rabe, 2017) -a foundation-independent meta-framework and API for knowledge management services. This integration makes the generic services provided by MMT-e.g. type checking, library management/browsing, translation -available to informal mathematical texts. Using alignments (Müller, 2019; Müller et al., 2017) , OMDOC-expressions can be translated between different libraries, languages and foundations. This allows for e.g. translating (originally) sT E X-content to a typed setting in order to e.g. check expressions and run type inference. Additionally, several theorem prover libraries have been translated to OMDOC and integrated in the MMT system, e.g. Kohlhase et al. (2017b) ; Müller et al. (2019) (for a detailed overview, see Müller (2019) and Kohlhase & Rabe (2020) ). Extending these integrations to enable exporting from MMT as well (and in conjunction with natural language processing), this could enable verifying informal mathematics imported via sT E X using external state-of-the-art theorem prover systems. 

<OMA>

<OMS cd="smglom:mv?equal" name="equal"/> <OMA> <OMS cd="smglom:arithmetics?natarith" name="multiplication"/> <OMV name="x"/> <OMI>0</OMI> </OMA> <OMI>0</OMI> </OMA> 

3. STATE OF THE ART

Various papers over the last years have -explicitly or implicitly -attempted to extract formal information from informal documents using machine learning. These fall into two categories: Firstly, there are projects that attempt to fully formalize informal mathematical documents using machine learning techniques, using the surface language of some theorem prover system directly as a target. In Kaliszyk et al. (2017a; 2015; 2014) , the Flyspeck project (Hales et al., 2017) -the formalization of Kepler's theorem -was used as a basis for a parallel dataset in order to translate from informal mathematics to HOL Light (Harrison, 1996) syntax. Kaliszyk et al. (2017b) ; Wang et al. (2018; 2020) target the Mizar language (Mizar) instead, using the Journal of Formalized Mathematics (JFM) as data -an informal representation of the formal Mizar Mathematical Library (Bancerek et al., 2018) . While these projects achieved impressive results given the ambitious nature of the task, their success rate is naturally limited by the involved models having to solve several tasks at once (see second observation in Section 1), including ours. Additionally, by going to a fully formal language (and logical foundation) immediately, the result does not preserve the narrative presentation of the input document, effectively losing (for us) valuable information in the process. Consequently, our task and results obtained on it are not directly comparable to these projects. Secondly, various projects have aimed to solve informally presented mathematical problems of various kinds. These include Arai et al. (2014) ; Matsuzaki et al. (2014; 2017; 2018) on pre-university math problems, Saxton et al. (2019) and Lample & Charton (2019) on high-school level equations, Gan & Yu (2017) and Seo et al. (2015) on geometric problems, and Huang et al. (2018) and Wang et al. (2017) on solving typical high-school word problems. While this naturally entails disambiguating symbolic expressions, all these projects reduce their domain of applicability to specific areas where all occurring formal symbols are syntactically unambiguous -primarily common arithmetic operations, functions, and relations on real numbers -such that disambiguation reduces to simple parsing of a fixed, small set of a priori known symbols.

4. TASK DEFINITION

Definition 4.1. (Disamiguation Task) Let L be a set of L A T E X fragments (i.e. strings), which we assume are syntactically valid L A T E X in some suitable document context. A symbolic expression is (for our purposes, simplified) any substring s of some S ∈ L such that s is interpreted by the T E X-engine in math mode -e.g., if it is delimited by $, $$ or \[ and \] respectively. For the purposes of our task, we call S ∈ L fully disambiguated, if every symbolic expression occurring in S only consists of: 1. variable names (e.g. n or \mathcal{G}, provided they do not represent specific, definite mathematical objects), 2. sT E X macros introduced via a \symdef declaration in the SMGLoM, or 3. non-semantic commands or characters, such as additional spaces/tabs/linebreaks, purely aesthetic spacing or kerning commands, unnecessary parentheses or clarifying comments (e.g. in under-or overbraces). Let L sT E X ⊂ L the subset of fully disambiguated L A T E X fragments. Conversely, let L L A T E X ⊂ L be the set of L A T E X fragments that do not contain any sT E X macrosfoot_1 . Clearly, for any S ∈ L, there is some L A T E X(S) ⊂ L L A T E X such that S and any S ∈ L A T E X(S) represent the same symbolic presentation -i.e. they generate the same output on pdflatex. Conversely, we assume that for any S ∈ L there is a set sT E X(S) ⊂ L sT E X such that 1. L A T E X(S) = L A T E X(S ) for all S ∈ sT E X(S) (i.e. they have the same symbolic presentation) and 2. all S ∈ sT E X(S) capture the intended semantics of S -i.e. the author of S, were they to know the SMGLoM library sufficiently well, would agree that S is a correctly fully disambiguated variant of S. Our goal is to learn a function f : L → L such that for any S ∈ L we have f (S) ∈ sT E X(S). The last two symbolic expressions ($a$ and $b$) only consist of variable names, and are thus considered fully disambiguated already. The first one ($\cdot$) refers to the multiplication operator on natural numbers, which in sT E X is represented as \nattimesOp, the remaining symbolic expressions are all multiplications on natural numbers applied to the variables a and b with different notations, represented in sT E X via \nattimes with various options. We expect the target function f on this input sentence to output Multiplication $\nattimesOp$ computes the product $\nattimes[cdot]{a,b}$ (also written as $\nattimes{a,b}$ or $\nattimes[x]{a,b}$) of natural numbers $a$ and $b$.

5. DATASETS

We have two datasets of sT E X-content: 1. The SMGLoMfoot_2 , which introduces precisely those macros that we want to be learned by a model. Unfortunately, it provides relatively few symbols and hence can only cover a small part of informal documents even in theory. Additionally, apart from some rudimentary concepts such as logical connectives or basic arithmetic functions, the SMGLoM library references the majority of symbols only once (in the corresponding dictionary entry). This is unlike most other formal systems, where all symbols need to be typed or defined formally when being declared, which naturally leads to a significant number of references to previously declared symbols. 2. The MiKoMHfoot_3 -repository of lecture notes by Michael Kohlhase (the author of sT E X) is heavily biased towards subjects in computer science, covering only a small part of SMGLoMentries, and often introducing local \symdefs. Notably, while the translation from source to target language is difficult, the reverse translation (from sT E X to plain L A T E X) is easy: Since sT E X macros internally expand (ultimately) to the plain notational representation as basic L A T E X, translating from the target to the source language amounts to merely expanding sT E X macros. This allows for easily generating a parallel dataset from a set of documents in the target language. To obtain such a parallel corpus for supervised learning, we take the individual L A T E X-files in those repositories and do the following: 1. We separate the documents into small fragments of (on average) 500 character lengths, which we consider to be the sentences in L sT E X . Symbolic expressions occur preferably at the end of a sentence, based on the assumption that preceding text provides a more meaningful context for disambiguation. Sentences that do not contain symbolic expressions are ignored. 2. In each sentence S = S sT E X ∈ L sT E X , we perform some standardization function which e.g. removes non-semantic macros and ensures that macro arguments are always braced, in order to minimize author bias, 3. We extract all symbolic expressions (m sT E X,i ) i≤n S in S and expand all sT E X macros in them, resulting in (m L A T E X,i ) i≤n S (where n S is the number of symbolic expressions in S). Analogously, we expand all sT E X macros in S itself, yielding S L A T E X ∈ L L A T E X . Each entry in our dataset then consists of a 4-tuple (S L A T E X , S sT E X , (m L A T E X,i ) i≤n S , (m sT E X,i ) i≤n S ). In total, we obtain 911 entries from SMGLoM and 9200 entries from MiKoMH. Synthesizing Training Data In order to augment our datasets for supervised learning, we opted to exploit the MMT integration to synthesize additional training data. For that, we aligned SMGLoM symbols with declarations in a strongly typed MMT archive; namely the Math-in-the-Middle (MitM) library (Müller, 2019) . This allows us to randomly generate welltyped (and hence syntactically well-formed) terms in a typed setting, translate these along alignments to sT E X expressions and subsequently generate surrounding verbalizations. The generating algorithm takes as input a set of symbols Sym (e.g. all MitM-symbols for which an alignment to SMGLoM exists) and a starting symbol s ∈ Sym (e.g. nattimes; binary multiplication on natural numbers). It returns a random well-typed formal expression t which is guaranteed to contain s. Afterwards, it is verbalized as an sT E X sentence using natural language fragments (a detailed description of the algorithm is given in Appendix A). The synthesized sT E X sentences are then treated as above to augment our parallel training corpus. As an evaluation dataset, we developed sT E X documents based on selected fragments of introductory sections from mathematics lecture notes; primarily containing basics such as set operations, number spaces, examples for proofs by induction, basic combinatorics, and definitions of common algebraic structures, containing 161 symbolic expressions in total. Importantly, these documents were written by hand, with a focus on featuring multiple symbols with the same symbolic representation; primarily the usual arithmetic operations on different number spaces. Of the ≈ 100 SMGLoM symbols used therein, 92 were aligned with corresponding symbols in the MitM library and used as input symbols for synthesizing sentences; with 250 sentences per starting symbol (as to not drown out the non-synthesized sentences), yielding 23,000 additional sentences. Unlike the training datasets, the evaluation document was translated to plain L A T E X manually using the PDF as a reference, in order to avoid possible spurious patterns in automatically expanded sT E X.

6. S T E X-ANNOTATING WITH MACHINE LEARNING AS AN NMT TASK

In the course of our experiments, we considered our disambiguation task as a machine translation (NMT) problem, the models for which have been proven to be quite effective even beyond natural language translations (Clark et al., 2020) . In fact, the autoformalization projects mentiond in Section 3, which are spiritually closest to our task, all used NMT models with positive results. There are however several aspects that distinguish a L A T E X-to-sT E X translation from similar translation tasks which significantly affect the applicability of existing tools and hence our methodology. First, Unlike the most popular formal systems, there is no large library of formalizations for the translation target. This leaves us with only a small dataset that (for the reasons outlined in Section 5) does not represent well the general distribution we would like to learn. Second, translation is only relevant for specific fragments of an input text, namely the symbolic expressions; for the surrounding natural language texts, translation should be the identity. Nevertheless, surrounding text usually contains critical information for disambiguation; e.g. without the surrounding context, it is impossible to disambiguate an expression a + b, since the symbol "+" could refer to any of dozens of addition operations. Finally, depending on perspective, the domain language is a proper subset of the target language; or rather (since we want to avoid ambiguous expressions in sT E X) domain and target language share both a basic grammar as well as a large amount of vocabulary (namely L L A T E X ∩ L sT E X ) which e.g. subsumes natural English. For the domain language, large datasets are easily obtainable. Our task could also be considered as a text style transfer task -e.g. Yang et al. (2019) uses pre-trained language models for text style transfer, roughly similar to (but more sophisticated than) our approach. While the datasets used therein are still considerably larger than ours, this might be a promising avenue for future improvements over our model.

7. METHODOLOGY

Notably, sT E X macros reflect the syntax tree of an expression, so that on symbolic expressions alone, the representation of the target sequences is naturally analogous to those chosen in string-to-tree translations (Aharoni & Goldberg, 2017) . Plain L A T E X however is not naturally amenable to a treestructured representation, making tree-to-tree approaches (Chen et al., 2018) not easily applicable to our dataset. Initial experiments using standard, dedicated NMT models with full sentences as input/output quickly proved to be ineffective due to the size of the training corpus, which was too small to cause these models to even generate syntactically correct L A T E X (e.g. knowing to balance pairs of brackets) before overfitting on the training data. This makes it difficult to compare our approach to an informative baseline model. Transformer language models (e.g. 2020)) allow us to leverage huge available corpora of plain L A T E X documents to train a model to "understand" both basic L A T E X syntax and mathematical terminology. Using those, we consequently do not need to rely on our small dataset for this base-level understanding. We can then approach learning sT E X annotations as a downstream task on a pre-trained transformer model. Consequently, we pre-trained a GPT2 (Radford et al., 2019) model on a large portion of available L A T E X sources of scientific papers from the preprint repository arxiv.org (6,673,950 entries of length 1,024 tokens). The model was trained from scratch in order to use a dedicated tokenizer trained on L A T E X directly (byte-level tokenizer; vocabulary size 32,000) rather than natural language alone. In order to leverage the pretrained model for both source and target language 5 , we subsequently opted to fine-tune the GPT2-model on inputs of the form S L A T E X <s> m L A T E X <s> m sT E X <s>, where <s> a single-token separator. 6 For example, for Figure 1 the training data contains fragments (normalized) such as: Multiplication $\cdot$ computes the product $a\cdot b$ (also written as $ab$ or $a\times b$) of natural numbers $a$ and $b$. <s> $a\cdot b$ <s> $\nattimes[cdot]{a,b}$ <s> We then use text generation on inputs of the form S L A T E X <s> m L A T E X <s> for translating and stop generating after encountering <s>. By using one entry per symbolic expression, we obtain a dataset of 121,368 examples. The GPT2model was finetuned on these for five epochs, resulting in an average training loss of 0.04 and yielding promising results on the evaluation set (see below). This approach has the following advantages: 1. It allows for using large datasets of generic L A T E X documents to learn basic syntactic rules and semantics of mathematical expressions beyond our small sT E X datasets. 2. We conjecture that this approach makes the model less sensitive to spurious patterns in the synthesized part of our dataset. 3. Adding new symbols to the SMGLoM and aligning them to (new or existent) symbols in the MitM library allows for immediately synthesizing training data, obviating the need to first obtain large amounts of data using the new symbol before the model can learn to use it. 4. The mere pretrained GPT2 model can be trained on additional downstream tasks, e.g. introducing macros for referencing mathematical concepts in natural language fragments.

8. EVALUATION AND RESULTS

The traditional evaluation metrics (loss during evaluation, perplexity, BLEU) are somewhat difficult and/or meaningless to apply in our situation, since 1. the returned tokens and provided label tokens might differ in semantically irrelevant ways (e.g. $a+b$ vs. $a + b$), and 2. loss/perplexity would be evaluated during a forward pass in a next token prediction task on a token-by-token basis, which would retroactively "correct" errors in prediction that would otherwise yield completely wrong result. Consequently, we opted for a plurality of evaluation strategies. Let S F the returned sentence of our model on an input S L A T E X with the correct label S sT E X . Then on our evaluation set we get 1. S F ∈ L for 96.9% of inputs 2. S L A T E X ∈ L A T E X(S F ) for 64.0% of inputs, 3. S F ∈ L sT E X for 60.2% of inputs, and 4. S F = S sT E X for 47.2% of inputs. In comparison, using traditional NMT models auch as Luong et al. (2017); Vaswani et al. (2017) we effectively obtained 0% success rates for all of the above. Additional evaluation techniques exploiting the MMT integration are described in Appendix B. Figure 2 shows a few examples where our model "failed" in interesting ways. As the first and fourth examples show, the model seems to consistently fail to replace "=" by the intended macro \eq -a failure that LaTeXML can recover when converting to OMDOC, but also regularly occurs in the training data. Similarly, \ldots often leads to wrong translations: The first example shows that the 5 Initial experiment with the pretrained model as encoder component only showed improvements over randomly initialized encoder-decoder-models, but ultimately proved unsuitable still due to the small dataset size. 6 inspired by http://jalammar.github.io/illustrated-gpt2/ #part-3-beyond-language-modeling S L A T E X : \mathbb{N}=\{0,1,2,3,\ldots\} S sT E X : \eq{\NaturalNumbers,\setdots{0,1,2,3}} S F : \NaturalNumbers=\set{0,1,2,3} S L A T E X : (A \subseteq B)\Leftrightarrow(\forall x\in A. x\in B) S sT E X : \biimpl{\sseteq{A}{B}}{\foral{\inset{x}{A}}{\inset{x}{B}}} S F : \biimpl{\sseteq{A}{B}}{\foral{x}{A}\inset{x}{B}}} S L A T E X : \mathcal{P}(A):=\{x|x\subseteq A\} S sT E X : \defeq{\powerset{A}}{\setst{x}{\sseteq{x}{A}}} S F : \defeq{\powerset{A}}{\bsetst{x}{x}{\sset{x}{x} A}} S L A T E X : 1+2+3+4+5=(5\cdot6)/2=15 S sT E X : \eq{\natplus{1,2,3,4,5},\natdiv[slash]{\nattimes[cdot] {5,6}}{2},15} S F : \natplus{1,2,3,4,5}=\natdiv[slash]{\natplus{\nattimes[cdot] {5,6},4,5}}{2}=15 Figure 2 : Example Inputs and Outputs from our Evaluation Set model simply dropped \ldots, using a generic set constructor macro \set rather than \setdots, the one specifically intended for sets ending in ellipses. In the second example, the model seems to introduce a nonsensical additional argument for the \foral macro. Notably, the expression ∀x ∈ A.P can also be achieved using the dedicated macro \foralS{x}{A}{P}. Seemingly, the model chose the macro \foral, and the arguments for the \foralS macro, yielding a wrong translation that generates a wrong pdf output, while being "semantically almost correct". In the third example, the model confuses the macro \setst (for set comprehension) with a more complex macro \bsetst (for set comprehension with a complex pattern on the left side). Additionally, it confuses \sseteq (for inclusive subsets x ⊆ A) with \sset (for generic subsets x ⊂ A), duplicating the first argument and moving the intended argument A outside the scope of the macro. Example four is interesting in that the model correctly identifies the arithmetic operations as those on the natural numbers, but spuriously inserts an additive term \natplus{...,4,5}; this is likely an artifact from the left-hand side of the equation. Interestingly, these kinds of artifacts occur more than once in our evaluation set.

9. CONCLUSION

We have proposed the task of disambiguating symbolic expressions in informal STEM documents and defined this task formally. This allows for annotating informal documents semantically, and further processing them using tools that support such annotated documents (e.g. MMT). We discussed the specificity of this task and what separates this task from other NMT problems. We developed a dataset for this task and presented an approach that yields promising results, especially in light of the size of the dataset. In particular, the presented approach points to the efficacy of using transformer models pretrained on generic L A T E X documents. In the future, we plan to combine the proposed symbolic disambiguation approach with an autoformalization framework. This way we aim to achieve better results for end-to-end formalization of informal mathematical documents. Furthermore, more promising results for the currently proposed task could be obtained by reintegrating the proposed models into an encoder-decoder NMT model.

A SYNTHESIZING TRAINING DATA

The generating algorithm takes as input a set of symbols Sym (e.g. all MitM-symbols for which an alignment to SMGLoM exists) and a starting symbol s ∈ Sym (e.g. nattimes; binary multiplication on natural numbers). The algorithm then proceeds as follows: 1. If s : T has a (simple or dependent) function type, we fill in the required arguments. For s =nattimes, our type is T =Nat→Nat→Nat, hence we need to find two arguments s 1 , s 2 of type Nat. For each s i of required type T i we proceed as follows: (a) With probability p var , we introduce a new variable v : T i from a list of allowed variable names (which include variants such as a, a , a 0 etc.) and let s i := v. (b) With probability p f un , we pick a symbol f ∈ Sym with a function type with return type T i (e.g. for T i =Nat, we can pick natplus). In that case, we let s := f , recurse, and set s i as the result. (c) With probability p const = 1 -p var -p f un , we pick a constant symbol c ∈ Sym of type T i (e.g. for T i =Nat we can pick 0) and return s i := c. In order to avoid stack overflows, we reduce p f un in each iteration by a certain factor < 1. As to not overuse certain symbols, we scale p f un and p const with the number of respectively suitable symbols available; if Sym contains no suitable function or constant symbols, we let p f un = 0 (and/or p const = 0, respectively). 2. If s : T does not have a function type (or all its parameters have been filled in 1.), then s is well-typed and we return s with probability 1 -p up . With probability p up , we instead pick a new symbol s f ∈ S of some function type such that some i-th parameter type of s f is T . In that case, we let s i := s and s := s f and recurse. Again, in order to avoid stack overflows we reduce p up by some factor with each iteration. The algorithm also takes subtyping into account, e.g. whenever a term of type Real is required, terms of type Int or Nat are used with some probability. In order to obtain a sentence in the sense of Section 5 providing context for disambiguation, we first translate t along alignments to SMGLoM (using a random \symvariant), collect the set V of all free variables of t and verbalize their types. For that, we associate each type with a set of verbalizations from which we choose randomly to produce a sentence that introduces the variables before using them in the generated expression. Figure 3 shows a few example verbalizations for a variable x of type Nat and generated sentences for the input symbol s =realuminus; the negation on real numbers. The verbalizations are categorized as prefixed (e.g. "a natural number n") or suffixed (e.g. "n a natural number"), and singular or plural, and picked according to the number of variables of the same type and the surrounding sentence, which is also picked at random (e.g. "Assume we have ..." uses prefixed, whereas "Let ..." uses suffixed).

B EVALUATION TACTICS

For every L A T E X input S L A T E X , expected label S sT E X and returned sentence S R , we employ the following strategies, the results of which are summarized in Figure 4 : islatex We parse S R into an AST. Success implies that S R is syntactically valid L A T E X. This might fail for "minor" reasons such as a missing closing bracket. It might yield false positives in cases where macros (not explicitly considered by our parser) occurring in S R have a wrong number of arguments. All subsequent evaluation strategies require islatex to succeed. stexcheck We heuristically check whether S R is in L sT E X -unlike islatex, this requires that all sT E X macros occurring in S R have the right number of arguments. Success does not tell us that the input has been disambiguated correctly, but does imply that is has been disambiguated at all. False negatives can occur if S R (and thus likely S L A T E X as well) Generated sT E X PDF output Verbalizations $\inset{x}{\NaturalNumbers}$ x ∈ N a positive integer $x$ a positive integer x an integer $\intmethan{x}{0}$ an integer x ≥ 0 a natural number $x$ a natural number x Sentences Assume we have some $\inset{y'} {\NaturalNumbers}$ and arbitrary $\inset{\mathcal F}{\IntegerNumbers}$. It follows that $\realuminus{\realuminus {\inttimes[x]{\mathcal F,y',y'}}}$. Assume we have some y ∈ N and arbitrary F ∈ Z. It follows that --(F × y × y ). Let $\natmorethan n{0}$. Then consider $\realuminus{\realuminus{\natsucc{ \natsucc n}}}$. Let n > 0. Then consider --S(S(n)). Whenever we have some positive natural number $\varepsilon$, any integer $\ell$ and a real number $\livar{\mathcal C} {2}$, then it follows that $\realtimes{ \livar{\mathcal C}{2},\livar{\mathcal C} {2},\realplus{\realuminus{\ell},\natsucc{ \varepsilon}}}$. Whenever we have some positive natural number ε, any integer and a real number C 2 , then it follows that C 2 C 2 (-+ S(ε)). eval_latex All sT E X macros occurring in S R are expanded and S R is normalized as described in Section 5. The result is string-compared to S L A T E X . Success thus implies, that the notational presentation in PDF output of S L A T E X and S R will coincide. False negatives can occur due to minor differences e.g. in not strictly necessary brackets. omdoc S R is translated to OMDOC using LaTeXML and imported to MMT. Success guarantees syntactic well-formedness of S R . Since both the LaTeXML-OMDOC export and the subsequent MMT-import are somewhat brittle, this can easily lead to false negatives. translated The import from omdoc is translated to the typed MitM library. This entails that all symbols used in S R are aligned with MitM symbols and S R is amenable for formal knowledge management services. inferred The translation to MitM obtained from translated is type checked by MMT by having its type inferred. Success guarantees that S R is well-typed. Notably, if S R is a mere variable (e.g. the expression $n$), it does not actually have an inferrable type, but succeeds trivially. This accounts for 60 of the entries in our evaluation set, i.e. 37%. provided_stex Both the expected label S sT E X and S R are normalized and string-compared. Success implies that S R is definitely the correct translation. False negatives can easily occur due to non-semantic differences between S sT E X and S R however, such as bracketing, nested applications in S R (e.g. $\natplus{\natplus{a,b},c}$ vs. $\natplus{a,b,c}$), etc. stex_as_omdoc S sT E X is translated to OMDOC via LaTeXML and directly compared to the OMDOC-term obtained from omdoc. Like provided_stex, success implies that S R is correct, but it is more fault-tolerant with respect to the precise syntax of S R , while being less fault tolerant due to the issues mentioned in omdoc. The first three evaluations can always be applied; from the remaining, all but provided_stex require a working installation of LaTeXML and its sT E X-Plugin. The last two require a known correct translation. 



All code and data relevant to this paper is available at https://gl.kwarc.info/dmueller/ fifom. Note that LLAT E X and LsT E X are not disjoint https://gl.mathhub.info/smglom https://gl.mathhub.info/MiKoMH



Definition. Multiplication • computes the product a • b (also written as ab or a × b) of natural numbers a and b. It is defined by the equations x • 0 = 0 and x • S(y) = x + x • y.

Figure 1: An sT E X Example: The OMDOC corresponds to the symbolic expression x • 0 = 0

Consider the sentence from the SMGloMMultiplication $\cdot$ computes the product $a\cdot b$ (also written as $ab$ or $a\times b$) of natural numbers $a$ and $b$.

Figure 3: Example Verbalizations for x :Nat and Generated Sentences

Results on our Evaluation DocumentA detailed log file on our evaluation document with the individual results for each input and evaluation is available in the associated git repository.

ACKNOWLEDGMENTS

The first author and this work were supported by a postdoc fellowship of the German Academic Exchange Service (DAAD).The second author is supported by ERC starting grant no. 714034 SMART

