Task 3: Systems Programming Languages

3 Task 3: Systems Programming Languages

Here our main focus is on abstractions used for real-world systems programming, looking at the C language (both sequential and concurrent aspects), runtime typechecking for systems code (in the libcrunch system), linking (specifically at the ELF format), CHERI C JavaScript, WebAssembly, and OCaml. We have also contributed to the CakeML verified ML compiler with the ISA modelling work described in Task 1.

3.1 Sequential C: Cerberus

The C language, originating around 1970, remains at the heart of a great deal of security-critical systems software, including operating systems, hypervisors, and embedded and Internet-of-Things devices – and many security flaws ultimately come down to C code. Ambiguities and differences of opinion about what the language is thus have considerable potential for harm – but it has never been well-defined. In earlier (pre-REMS) work, we collaborated closely with the ISO WG21 C++ and WG14 C committees to clarify their concurrency models, with our mathematical models (in prose form) incorporated into those international standards.

Our work on the sequential aspects of C semantics proceeds on several fronts: investigating the de facto standards of C as it is implemented and used; addressing and contributing to the ISO standard produced by the ISO WG14 C committee; and building an executable formal model for a substantial fragment of C, Cerberus [27, 28]. Our initial paper on this [27] was awarded a PLDI 2016 Distinguished Paper Award. We intend to release Cerberus under a permissive open-source licence in due course.

Cerberus web page https://www.cl.cam.ac.uk/users/pes20/cerberus/
Cerberus web interface: http://cerberus.cl.cam.ac.uk/cerberus

We are participating in the ISO WG14 C standards committee: we attended the April 2016 (London), October 2016 (Pittsburgh), April 2018 (Brno), and October 2018 (Pittsburgh) committee meetings, producing committee discussion papers ([29, 30, 31, 32], [33, 34, 35], [36, 37, 38, 39, 40], [41]), and instigated the WG14 C Memory Object Model Study Group; we are also engaging with the various compiler and developer communities, with talks at EuroLLVM 2018 and the GNU Tools Cauldron 2018, to Red Hat, and at 35c3.

3.2 CHERI C

At the C language level, just as at the architecture level, we have another strong collaboration with the CHERI work, with their experience in systems software (and specifically in porting it to the C supported by CHERI) informing our semantic understanding, and with this discussion identifying issues with the CHERI design. This made substantial contributions to the two Cerberus papers cited above [27, 28], and we have also contributed to work on the CHERI ABI, enforcing pointer provenance and minimising pointer privilege in a POSIX C environment [57].

We also developed a conceptual framework for application compartmentalization, a vulnerability mitigation technique employed in programs such as OpenSSH and the Chromium web browser, which decomposes software into isolated components to limit privileges leaked or otherwise available to attackers. This is embodied in an LLVM-based tool: the Security-Oriented Analysis of Application Programs (SOAAP), that allows programmers to reason about compartmentalization using source-code annotations (compartmentalization hypotheses) [56].

3.3 C runtime type checking: libcrunch

We are developing a system for run-time type checking in C, libcrunch, extending it to additionally check array/pointer bounds [51, 52, 53]. Using run-time type information, we have found it possible to tolerate real C idioms not supportable by software fat-pointer schemes. Early indications suggest that the available run-time performance is comparable or better than competing approaches.

We have continued to refine the run-time type infrastructure underlying libcrunch into a more general-purpose infrastructure with a variety of other applications [55] including debugging, interfacing native code with dynamic languages, and dealing with memory-mapped files.

Libcrunch github repo: https://github.com/stephenrkell/libcrunch/

3.4 ELF linking: linksem

Systems software in reality is written not just in more-or-less conventional programming languages, like C and C++, but also using a complex menagerie of techniques to control the linking process. This linker speak is often poorly understood by working programmers and has largely been neglected by language researchers. We have surveyed the many use-cases that linkers support and developed the first validated formalisation of a realistic executable and linkable format, ELF, together with an executable specification of linking with substantial coverage [54]. This also incorporates a semantics for DWARF debug information. The model is written in Lem, which generates an executable OCaml version. This ELF model is used in our Task 2 RMEM architectural emulation tool. The model is available under a permissive open-source licence.

ELF semantics github repo: https://github.com/rems-project/linksem.

3.5 Concurrent C/C++

Turning to the concurrency semantics of C-based languages, in earlier (pre-REMS) work, we collaborated closely with the ISO WG21 C++ and WG14 C committees to clarify their concurrency models, with our mathematical models (in prose form) incorporated into those ISO C/C++11 international standards. (Batty et al., POPL 2011). We have built on those, and also considered potential future models that would fix the so-called ‘thin-air problem’ that C/C++11 suffers from.

Batty’s 2014 Cambridge PhD thesis The C11 and C++11 Concurrency Model was awarded the 2015 ACM SIGPLAN John C. Reynolds Doctoral Dissertation Award, presented annually to the author of the “outstanding doctoral dissertation in the area of Programming Languages”, and was the winner of the 2015 CPHC/BCS Distinguished Dissertations competition, selected as “the best British PhD/DPhil dissertations in computer science”. The WG14 C committee also approved his remaining defect reports w.r.t. the C11 concurrency model.

We contributed to work in LLVM implementing the C/C++11 concurrency model [49, 50].

We developed a C11 operational semantics that is proved equivalent (in Isabelle/HOL) to the Batty et al. axiomatic model [43]. This was also integrated into our Cerberus sequential C model to allow litmus tests to be executed, though that integration is not currently maintained, to simplify Cerberus development.

We showed that no straightforward adaptation of the C/C++11 axiomatic model, phrased as a consistency predicate over candidate executions, can exclude thin-air reads while also admitting some standard optimisations, identified additional difficulties with undefined behaviour and with mixtures of atomic and nonatomic accesses. We also proved (in HOL4) that the C/C++11 model actually provides the DRF-SC guarantee [45].

We proposed a model that admits compiler optimisation while avoiding the thin-air problem [44, 47].

Along with the extension of our ARM and IBM POWER hardware models to support mixed-size accesses, described above, we proposed a corresponding extension to the C/C++11 axiomatic model to support mixed-size non-atomic accesses [8].

We discussed a new problem with C/C++11, uncovered by groups in Princeton and MPI-SWS/SNU, that contradicts the soundness of compilation schemes from C/C++11 to ARM and POWER (pre-REMS, we had published hand proofs of those that now turn out to be flawed); this required additional changes to C/C++11, developed by the latter group.

We proposed a safe, portable, and efficient implementation of co-routines [48].

3.6 Verified ML implementation: CakeML

We continue to contribute to the CakeML development of a verified ML implementation, especially w.r.t. the backend ISA models and proof, as discussed in Task 2.

3.7 JavaScript

We have developed JSExplain [60], a reference interpreter for JavaScript that closely follows the specification and that produces execution traces. These traces may be interactively investigated in a browser, with an interface that displays not only the code and the state of the interpreter, but also the code and the state of the interpreted program. In that respect, JSExplain is a double-debugger for the specification of JavaScript, useful both for standard developers and JavaScript programmers.

We have developed a system for logic-based verification of JavaScript programs [61] and a symbolic evaluation system for JavaScript [58].

We have developed JaVerT [59, 63], a semi-automatic JavaScript Verification Toolchain, based on separation logic and aimed at the specialist developer wanting rich, mechanically verified specifications of critical JavaScript code. JaVerT is the first verification tool based on separation logic that targets dynamic languages. To specify JavaScript code, we provide abstractions that capture its key heap structures (for example, prototype chains and function closures), allowing the developer to write clear and succinct specifications with minimal knowledge of the JavaScript internals. To verify JavaScript programs, we develop a trusted verification pipeline consisting of: JS-2-JSIL, a well-tested compiler from JavaScript to JSIL, an intermediate goto language capturing the fundamental dynamic features of JavaScript; JSIL Verify, a semi-automatic verification tool based on a sound JSIL separation logic; and verified axiomatic specifications of the JavaScript internal functions. JaVerT has so far been used to verify functional correctness properties of: data-structure libraries (key-value map, priority queue) written in an object-oriented style; operations on data structures such as binary search trees (BSTs) and lists; examples illustrating function closures; and test cases from the official ECMAScript test suite. The verification times suggest that reasoning about larger, more complex code using JaVerT is feasible.

This work has been presented in keynotes in CADE 2017 and FSEN 2017, and at an invited talk at Emerging Technologies, the 2017 New York ECMAScript meeting.

JavaScript semantics web page: http://psvg.doc.ic.ac.uk/research/javascript.html
JavaScript semantics (jscert) repo: https://github.com/jscert/jscert

3.8 WebAssembly

WebAssembly is a new low-level language currently being implemented in all major web browsers. It is designed to become the universal compilation target for the web, obsoleting previous solutions such as asm.js and Native Client. We have developed a mechanised Isabelle specification for WebAssembly, including a verified executable interpreter and type-checker, and a mechanised proof of type soundness [68]. This exposed several issues with the official WebAssembly specification and influenced its development. We have also worked on constant-time computation in WebAssembly [69].

WebAssembly AFP Isabelle development: https://www.isa-afp.org/entries/WebAssembly.html

3.9 OCaml concurrency and infrastructure

Working with our OCaml Labs colleagues, we have proposed a concurrency semantics for OCaml [70] and contributed to various OCaml infrastructure: a generic approach to defining partially-static data and corresponding operations [72]; conex [73], a tool which allows developers to cryptographically sign their released software, and users to verify them, ensuring integrity and authenticity without a centralised trust authority; and a modular foreign function interface for OCaml [71].