FORMAL SPECIFICATIONS FROM NATURAL LANGUAGE

Abstract

We study the generalization abilities of language models when translating natural language into formal specifications with complex semantics. In particular, we finetune language models on three datasets consisting of English sentences and their corresponding formal representation: 1) regular expressions (regex), frequently used in programming and search; 2) First-order logic (FOL), commonly used in software verification and theorem proving; and 3) linear-time temporal logic (LTL), which forms the basis for industrial hardware specification languages. Our experiments show that, in these diverse domains, the language models maintain their generalization capabilities from pre-trained knowledge of natural language to generalize, e.g., to new variable names or operator descriptions. Additionally, they achieve competitive performance, and even outperform the state-of-the-art for translating into regular expressions, with the benefits of being easy to access, efficient to fine-tune, and without a particular need for domain-specific reasoning.

1. INTRODUCTION

Translating natural language into formal languages is a long-standing goal of artificial intelligence research dating back to the 1960s (e.g., Weizenbaum (1966) ; Winograd (1971) ). Due to recent progress in deep learning (especially Vaswani et al. (2017) ) and the development of language models (LMs), the field has seen significant improvements, for instance, in the translation from natural language into coding languages or formal mathematics (e.g., Lewkowycz et al. (2022) ; Chowdhery et al. (2022) ; Chen et al. (2021) ; Wu et al. (2022) ). In this paper, we study the generalization abilities of a pre-trained LM when translating natural language into formal specification languages. Formal specification languages are used in various computer science fields to describe a system's desired behavior, including fields such as systems design, requirements analysis, and automated reasoning. Examples include specification languages based on logics, such as Alloy (Jackson, 2002) and LTL (Pnueli, 1977) , system specification languages based on state charts, such as SDL (Fonseca i Casas et al., 2013) , or text processing specifications based on regular languages, omega-regular languages, and automata theory (Aho, 1991; Thomas, 1990) . Compared to natural language, the benefit of a formal specification language is its unambiguous semantics making it accessible for algorithmic work that relies on a specification as input. Examples are high-performance SAT and SMT solvers (e.g., Sorensson & Een (2005) 2022)). Such approaches are naturally limited in their generalization capabilities. The natural questions arise: 1) Can off-the-shelf LMs achieve competitive performance when fine-tuned on this challenging translation task? 2) How well will they generalize with their pre-trained knowledge of natural language? In this work, we initiate a study on this topic by fine-tuning the open-source transformer language model T5 (Raffel et al., 2020) 2020)). Additionally, T5 is open-source and the trained models are easily accessible to a broad audience. We have picked three common yet diverse formal representations used widely in software and hardware domains: 1) regular expressions, frequently used in programming and text manipulation, 2) First-order logic, which is a standard formalism used in software domains, such as theorem proving, and 3) Linear-time temporal logic, which is used in hardware domains, such as model checking of sequential circuits. Regular expressions (regex), introduced by Kleene et al. (1956) , are sequences commonly used for text manipulation. For example, (a|b) * reads as "all sequences with no symbols other than a and b, including the empty string". First-order logic (FOL) extends propositional logic with predicates and quantification. With the foundations developed independently by Gottlob Frege and Charles Peirce (Peirce, 1933) , FOL is a formal system of high importance in mathematics, computer science, and linguistics. For example, the formula ∀x.∃y.¬(x = y) denotes that for every x, there is a y, which is not equal to x. Linear-time temporal logic (LTL) (Pnueli, 1977) is a hardware specification language widely used by the verification community. It forms the basis for industrial specification languages like the IEEE standard PSL (IEEE-Commission et al., 2005) . LTL extends propositional logic with temporal operators, specifying behavior over time. For example, when considering a controller for a shared resource, the formula (r → g) denotes that it is "always the case that a request r is eventually followed by a grant g". Our experiments show that the fine-tuned LM achieves competitive performance on all tasks and even improves state-of-the-art performance in translating natural language to regex by 6 percentage points. Additionally, the models can utilize pre-trained knowledge of natural language. For example, Figure 1 shows hand-picked in-distribution (ID) and out-of-distribution (OOD) examples for models trained on translating natural language to regex and LTL, respectively. The regex model generalizes to new nouns that were not present during fine-tuning. The LTL model was fine-tuned on "globally" and "always" as the translation of the LTL operator , on "implies" and "if then" as the translation of the implication →, and on variables i 0 to i 4 and o 0 to o 4 . It generalized to new variable names and operator descriptions, recognizing x and o9 as variables, "whenever" as a synonym for "globally", and a simple comma as a synonym for "implies". We provide detailed experiments in Section 4 showing, for example, that the regex model achieves the same accuracy on a held-out test set (> 88%) when being trained on only four out of 16 occurring nouns in the test set (c.f., Figure 2 in Section 4). In summary, we make the following contributions. We provide the first fine-tuned off-the-shelf language models for translating natural language into formal specifications, including a new state-ofthe-art model for translating into regular expressions. We contribute two novel datasets for translating natural language into FOL and two for translating natural language into LTL.foot_0 Furthermore, we analyze the generalization capabilities of the pre-trained language models by conducting generalization experiments on new variables, nouns, and operator descriptions, as well as out-of-distribution instances.



The datasets, models, and code will be published once the double-blind reviewing process ends.



; Biere et al. (2013); Audemard & Simon (2018); Moura & Bjørner (2008); Barrett et al. (2011)), planning tools LaValle (2006), model checkers (e.g., Cimatti et al. (2002); Holzmann (1997); Behrmann et al. (2006)), hardware synthesis tools (e.g., Bohy et al. (2012); Faymonville et al. (2017); Meyer et al. (2018)), or automatic theorem provers (e.g., Bertot & Castéran (2013); Nipkow et al. (2002)). Despite their benefits and various application areas, formal specification languages are still almost exclusively used by domain experts as their application requires significant domain-specific knowledge and extensive manual work. With the success of LMs, the goal of making the techniques mentioned above available to a broader user base to increase the correctness, trust, and assurance in computer systems is finally getting closer. So far, efforts in utilizing deep learning to translate natural language into formal specifications have relied on training (often over-engineered) neural networks from scratch (e.g., Singh et al. (2020); He et al. (

. The transformer architecture(Vaswani et al.,   Figure1: An ID example of a regex model trained solely on the noun "dog", tested OOD on new nouns "eye" and "time"; and an ID example of an LTL model trained on variables i 0 to i 4 and o 0 to o 4 , tested OOD on new variables and operator descriptions (bottom). OOD fragments are highlighted. 2017) has proven itself to be the most powerful general-purpose model at the moment of writing, setting new standards in many application domains such as computer vision (e.g., Dosovitskiy et al.

