Rambles around computer science

Diverting trains of thought, wasting precious time

Mon, 29 Sep 2014

Progress by distillation

A theme of Joe Armstrong's Strange Loop keynote was that creating stuff doesn't necessarily make progress overall. Proliferation reduces value. Entropy is our enemy. He mentioned the bewildering array of build systems, packaging systems, unidentifiable system software (I never did catch what “grunt” is) and, more generally, copies of the same or similar code and data. Building a “compressor” was his likeably vague solution.

My talk arguably continued this theme. I was talking about proliferation of dynamic languages—in contrast to the original Smalltalk vision, which didn't envisage the host of broadly similar dynamic languages which now coexist alongside it.

An underrated way to make progress is by distillation. It's to recognise repetition, to integrate and unify existing systems where possible. We don't have to do so perfectly or completely. It doesn't have to be a Grand Unifying Design. We can work opportunistically. But we need both the right mindset and the right infrastructure. Tools for creating are not enough. Tools for integrating and distilling are necessary.

I seem to have written about this before. It's nice to see the same ideas cropping up in a keynote!

[/research] permanent link contact

Project ideas 2014--2015

I maintain a brief list of ideas for student projects which I'm always willing to supervise. Here I'll elaborate on some that I'm particularly keen on this year. As always: these are just a selection. I'm very interested in having students discuss their own ideas with me, or suggesting variations on what I have proposed. Please contact me if any of these ideas (or those on the linked page) is of interest. Most of these ideas could make Bachelor's or Master's projects, with appropriate tailoring which we would discuss.

Program slicing at debug time

Program slicing is a powerful program transformation. It seeks to create a minimal program that explains some strict subset of a program's state at a particular point in an execution, throwing away code that did not influence that subset. For example, suppose I'm debugging the following (very stupid) program.

y = 12;
z = y * foo();
x = y + 1;

I might see, when debugging, that after the third line, variable x has the value 42. I want to know how it got that value. A program slicer could show me a minimised version of my program that eliminated any statements that did not contribute to the value of x. Here, that's the middle statement (probably! that's if foo() did not change y somehow).

Although Mark Weiser's original paper emphasises the fact that programmers do program slicing in their heads when debugging, automated slicing still hasn't entered the usual repertoire of features that programmers find in their debuggers. This project will implement and evaluate a slicer that is designed to be usable from a debugger. The evaluation is likely to be in size comparison with an existing slicer—smaller slices are usually better. We can also measure the speed of the slicing algorithm—since to be usable interactively in a debugger, it shouldn't be too slow.

A simple slicer is not too much work. There are also many directions for extensions. Precise slicing is a Turing-complete problem, so all approaches are approximate. Better approximation through cleverer program analysis is the obvious avenue. Another is amorphous slicing, which explores semantics-preserving changes to the program's syntactic structure, in order to produce a still smaller program. A third is to use knowledge of the program state itself as an extra constraint on slicing—not just the position we want to slice at, but also the values of program variables. Slicing can be seen as a kind of backwards program execution, with a user-customisable degree of concretisation introduced. The “most amorphous”, “most concrete” slice is simply the relevant subset of the (branch-free) trace of program execution (if it's unique; else the union of all possible traces leading to that state). This might not be as useful as a slice preserving slightly more of the program's execution.

There is also an incidental problem that will need treating somehow. It is that most debugging infrastructure doesn't directly you reconstruct the original source code that you're debugging (at least in the cases of native Unix-style debugging and Java debugging). Instead it enumerates the source files that went into each compilation unit, together with the line number ranges of each, and provides a shallow mapping of program counter values onto these files and lines. In Java this is very nearly enough; for C, it's less good because the preprocessing step is not explicitly described. So, there might be some insight to be had about how source files ought to be represented within debugging information. One extension might implement such an improvement, by judicious hacking on a compiler (most likely LLVM) to modify and/or supplement the debugging information that it currently outputs.

A garbage collector using `liballocs`

My work on liballocs is developing a run-time library that dynamically tracks allocations and their data types across (potentially) a whole process. This means you can ask it what's on the end of a pointer, and it can usually tell you—even for pointers to stack or heap memory, interior pointers, and so on. It can be thought of as implementing a reflection API, as a platform for building dynamic program analyses, and as a platform for implementing dynamic languages (see my talk!).

This project is about implementing a mark-and-sweep garbage collector on top of liballocs. This can be thought of as a replacement for traditional conservative collectors like the Boehm collector. If liballocs's view of the process is completely accurate, and no pointers are computed in way that we didn't anticipate, then we obtain a precise collector. To deal with unusual address computations or tricky encodings of pointers, we might still need to build in some conservatism nevertheless.

In any case, the liballocs API allows a collector to be more “federated” in its design. Traditional collectors are owned by a single virtual machine (VM) which makes strong assumptions about how objects are laid out, what addresses are issued, what parts of the process can point into “their” heap (the “roots” of tracing), and so on. With liballocs, a collector can make a reasonable attempt at tracing native data structures directly, without knowing in advance how they are laid out—the library returns metadata describing the layouts. This allows tracking reachability over paths that stray outside a single region and then reach back in—for example, from the collected heap into the malloc() heap and back.

This “federated” design is potentially (i.e. in future work!) an enabler of reasonably seamless programming in a mix of languages, or when mixing manual with automatic storage management. For example, interfaces like JNI force the programmer to explicitly signal to the collector when native code takes a reference to a Java object. by creating a “global references” (which is managed manually). With liballocs, it can simply trace the native data structures too.

Although liballocs can usually tell you what data type is instantiated on the end of a pointer, there are a few exceptions to this (such as on-stack temporaries), and this isn't quite all the information we need. For collectors, it matters whether storage is initialised (hence meaningful) or not, and whether a value is live or dead. It also matters what funky address computations a program might do (all the way up to extreme cases like XORed pointers!). A large part of this project will mean finding some kind of solutions to these, as well as gathering an understanding of when they do and don't matter in practice. Most of these issues have “big hammer” solutions which work fairly well but are slow (like replacing malloc() with calloc() to avoid uninitialised heap data) and also more clever solutions (like skipping this if we can prove that an allocation site always initializes the whole heap block before the next safepoint). So there is plenty of scope to pick extensions to the core project.

Obvious comparisons for evaluation are the Boehm GC, and also perhaps the garbage collectors of the Go or D languages. We can compare in both speed and precision (or, inversely, conservatism). The Boehm GC is well known for having observable conservatism, keeping objects alive when they are not reachable, because of integers that alias pointers. We would expect to do better than Boehm for precision, while retaining comparable performance (within a factor of 2 for total memory management time overhead, say). An MPhil project would set more ambitious goals and use more advanced garbage collection algorithms (e.g. a generational and/or compacting collector).

A bounds checker in `libcrunch`

My work on libcrunch uses liballocs to implement run-time type checking for C (and other languages, like C++ and Fortran, in due course). The basic idea is that whenever a pointer cast occurs, we check that the cast-to type “matches” what's on the end of the pointer (for a sufficiently refined notion of match). One weakness of libcrunch is that it assumes the program is memory-safe, both spatially and temporally. Put differently, it will catch type errors caused by bad cast logic (analogous to a ClassCastException in Java) but not those occurring as a consequence of buffer overflows (“spatial” errors), nor use-after-free or use-before-initialize behaviours (“temporal” errors). This project will implement spatial correctness checking that integrates with libcrunch.

Unlike conventional spatial bounds checkers, like SoftBound or ASan, we have type information (from liballocs) so can do better in various ways. Most significantly, there is no need for per-pointer metadata; per-object type metadata is enough (this holds for C; ask me for the detailed reason why!). This means we don't need to move metadata around as we move pointers around, so we should usually see some performance wins. It also doesn't matter if we haven't instrumented all code; as long as we observe allocations, we will have the necessary metadata. However, there are some drawbacks: querying pointer metadata will involve a short search, via the containing object's metadata, rather than a direct-mapped lookup as in SoftBound. The main target of the project should therefore be to go roughly as fast as SoftBound. Lower memory overhead and lower virtual address space usage are also likely benefits of this approach; these can be measured.

System call interposition, specification and emulation

To understand the precise semantics of a user-level binary program running atop a commodity operating system, we must understand not only the instructions it executes but also the system calls it makes. Currently, we are making progress on formally specifying instruction set architectures, but the system call interface remains surprisingly murky. What system calls may a program make? What, precisely, are the valid arguments to each system call, and what do they mean? Given some arguments, what memory might a given system call touch? What other effects might it have on the program state?

This project will build a toolkit for observing, specifying, intercepting and emulating system calls as made by real, large programs. As a starting point, we have a basic (rather limited) preexisting infrastructure for trapping system calls in an x86-64 Linux program, and a sketch of a domain-specific language (not set in stone) for describing a system call interface. The project will produce a usable end-to-end system for intercepting, specifying and emulating system calls. In a basic usage, we would divert the calls back to the underlying operating system and observe it (e.g. using SystemTap or DTrace) to check the accuracy of our model. In a more advanced usage we would instead be running the program in an emulator, and would update the emulator's state to reflect the effect of the system call. The intention is to have this work for various pairings of OS kernel (Linux, FreeBSD) and instruction set architecture (Intel, Power, MIPS, ARM), although at least initially, some pairings are more important than others.

This project is very researchy, and has both practical and theoretical aspects. It's motivated by the research we're doing in the REMS project.

A linker, mostly functionally

Much like compilers, linkers have a crucial influence on the meaning of the eventual binary program. Current linkers are big piles of imperative code, with little explicit specification. This project concerns building either a static or dynamic linker for ELF binaries in a mostly functional style, in a way that will yield a clear specification of the various parts (relocation, address assignment, section concatenation or merging, etc.). Ideally the linker would be written as an “executable specification”, likely in the Lem language, making explicit the points where nondeterministic choice is taken within the written specification (as far as it exists).

Either a static or dynamic linker can be attempted; in the dynamic case, we have a very basic skeleton already. It won't be feasible to implement a fully-featured linker, but we will carve out some subset depending on interests.

An optimising linker

Traditionally, linkers do not understand the instructions in a program. Instead, they blindly patch binary code in predefined ways, such as “add n bytes to the four-byte signed integer at this address” (which happens to be the relative-address field inside a call instruction, say).

However, a linker is potentially a very good place to do optimisation, because interprocedural flows are necessarily visible to it. A dynamic linker is potentially a very good place to do dynamic optimisation, because it can observe code being loaded and unloaded, hence can perform optimisations speculatively yet safely. So, there is a case that linkers should understand instruction streams.

Current toolchain support for link-time optimisation (LTO) is limited, in both gcc and in llvm, by working on the toolchains' intermediate representation. This must somehow be propagated into the input binaries—the binaries must be “compiled for” link-time optimising. An alternative approach is to bite the bullet and teach the linker about instruction streams, so that it can disassemble and re-optimise the instructions directly, likely using debugging information as a source of type information where this is helpful.

Some other interesting applications of link-time instruction stream rewriting include whole-program instrumentation (e.g. to intercept certain procedure calls, system calls, or certain memory accesses, such as a generational garbage collector's write barrier), reference widening (to overcome the complexity of code models) and speculative dynamic optimisation (e.g. to do “devirtualisation” of indirect call instructions). One of these could perhaps be addressed as an extension.

The chief evaluation will be on performance improvements. There are also benefits in terms of binary size, relative to traditional LTO toolchains, which can be measured too.

[Update: I added another project suggestion in a separate post.]