Rambles around computer science

Diverting trains of thought, wasting precious time

Tue, 14 Feb 2017

Custom ELF program headers—what, why and how

A couple of times recently I've found myself wanting to create ELF files with custom program headers. This post explains what that means, why you might want it, why there's a practical difficulty in doing so using a standard linker, and my slightly hacky way of solving it.

Program headers are metadata within ELF files. They describe a contiguous chunk of the file's address space. There are several different kinds of program header, of which the most critical is LOAD, signifying that the relevant part of the file should be loaded (or mapped) into memory, forming a segment within the program's memory image. Besides LOAD, other program headers are used to attach particular meanings to specific ranges within the binary. For example, the DYNAMIC program header tells the loader where the dynamic-linking metadata resides. The INTERP program header identifies to the (in-kernel) loader a string naming the program interpreter (the dynamic linker, which will actually do the loading proper). Some vendor extensions exist for adding features within the loading process, such as GNU_RELRO (which records a range that can be made read-only after relocation, bringing security benefits) or GNU_STACK (which does not identify an associated range, but exists as a flag to tell the loader whether the stack needs to be made executable). If you run readelf -l on some ELF binary, you'll see a dump of its program headers. On my machine I see something like this.

$ readelf -Wl /bin/true
Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0058b4 0x0058b4 R E 0x200000
  LOAD           0x005e10 0x0000000000605e10 0x0000000000605e10 0x000404 0x0005f0 RW  0x200000
  DYNAMIC        0x005e28 0x0000000000605e28 0x0000000000605e28 0x0001d0 0x0001d0 RW  0x8
  NOTE           0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x004b4c 0x0000000000404b4c 0x0000000000404b4c 0x00023c 0x00023c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x005e10 0x0000000000605e10 0x0000000000605e10 0x0001f0 0x0001f0 R   0x1

Two reasons for wanting to add custom program headers are to add custom features to the loading process, or to make new uses of features it already provides. My first use case was an instance of the latter: to create a heap area at load time, as an additional segment, rather than the usual way of making explicit syscalls during execution. I was building ELF files for the ppcmem emulated execution environment. This doesn't implement any system calls, but does implement an ELF loader. By adding an extra LOAD program header providing an allocation arena, I could write a malloc() that does not require system calls—its heap arena is mapped at start-up by the ELF loader inside ppcmem. (Although I could have allocated space using a big char array, or by hacking the linker script to leave a big gap, a separate segment is logically cleaner, and a disjoint address range for heap allocations is helpful.)

The second use case for custom program headers comes from liballocs. When passed through the liballocs toolchain, a binary also generates extra metadata binaries which can optionally be loaded at run time to supply extra metadata similar to debugging information. Currently these files live outside the main binary, in a separate /usr/lib/allocsites hierarchy, which is a pain for deployment: if you move or copy an executable, you must move or copy its metadata as an extra step. I'd like to bundle the metadata into the binary somehow, and record its presence using a custom program header.

The way to specify the program headers is in the linker script. The problem with creating custom program headers is that a GNU-style linker's script language requires you to specify the entire set of program headers if you specify any program header at all. You can't just add one extra. Worse, the default set of headers is not even made explicit anywhere. The PHDRS command is what specifies the program headers, but the default linker script (try ld --verbose to see it) doesn't use it. Rather, it falls back on the linker's hard-coded default behaviour, which is to create all the familiar headers that we see above.

If we write our own PHDRS command we can create all the program headers we want—but we have to create them all ourselves, meaning manually creating all the ones that the linker would otherwise create for us. The set of headers created by hard-coded logic in the linker has grown over the years, to include things such as the GNU_STACK and GNU_RELRO headers I mentioned earlier. Re-creating this “standard” set of program headers is therefore quite a detailed platform-specific task, and is subject to change as new features are added; a burden we don't want to take on.

So, I came up with a way of getting the effect we want: “please create this extra program header, in addition to all the normal ones”. It is not pretty and not perfect, but it works as follows. Run the linker once, without any custom program header directives. Remember what program headers the linker generated, and what sections it assigned to them. Dump the linker script used during that link, and annotate it so that each output section is labelled with its program header assignment. Then, apply a diff to the linker script adding only the program header we care about. Finally, re-link with the diffed script.

The obvious downside is that we link twice. We're also a bit fragile to the organisation of the default linker script. Still, I'm prepared to pay these prices. I've implemented this approach with a pleasingly nasty (but small) pile of shell scripts, make rules and m4 macros. It now works pretty well for me. Here's a brief tour.

First, we have some makerules to dump the default linker script.

default-dynexe.lds: $(shell which ld)
    $(LD) --verbose 2>&1 |\
      sed -e '/^=========/,/^=========/!d;/^=========/d'\
          -e 's/\. = .* + SIZEOF_HEADERS;/& _begin = . - SIZEOF_HEADERS;/'\
      > "$@"

(Thanks to Dan Williams for the sed script, which I took from his page about creating a custom dynamic loader, which is well worth a read.)

Next, we have some makerules to turn such a script into an m4'ised macro-script. Essentially, this macroisation adds two “hooks” into the script: one to include a file describing the “normal” program headers and insert any new ones; and one to wrap each output section in a macro invocation that will decorate it with its corresponding program header names. This macroisation rewrite is easily done with regular expressions.

%-dynexe.lds.phdrify.m4: %-dynexe.lds
    cat "$<" | sed 's#SECTIONS#PHDRS\n{\n\tinclude(phdrs_dynexe.inc)\n\tinclude($(srcdir)/phdrs_extra.inc)\n}\nSECTIONS#' | \
    tr '\n' '\f' | sed -r \
    's@(\.[-a-zA-Z0-9\._]+)[[:space:]\f]+(([^\{]*[[:space:]\f]*)?\{[^\}]*\})@expand_phdr([\1], [\2])@g' | \
    tr '\f' '\n' > "$@"

Next, our phdrified linker script contains output section blocks that used to look like this

  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }

now wrapped in macro invocations, like this.

  expand_phdr([.rodata], [: { *(.rodata .rodata.* .gnu.linkonce.r.*) }])

Once we have a macroised linker script, we need to m4-process it with some macro definitions that turn it back into the script we want. We need two kinds of definition for this. The first is a list of the program headers in the file as it was originally linked. The easiest way is to use readelf to dump the existing ones. As a wart: since (on my machine at least) the linker doesn't understand the full range of symbolic (named) values for the program header's type field, we use the C preprocessor to macro-expand the symbolic names down to hex literals (as defined in elf.h), which it does understand. The second kind of definition is going to expand each output section to explicitly mention which program headers it belongs to. Note that a section can be in more than one program header. I used a bash script with associative arrays, again operating over readelf output, which handily outputs the mapping from segments to section lists. The script inverts this mapping, into a section-to-segment-list mapping. This generates a single m4 macro whose body is a giant if-else chain testing the argument section name against all the sections in the file, and yielding the list of program headers that the section is in. The easiest way is to use readelf to dump the existing ones. The result is a chunk of m4-ified text that describes the program headers in the original link, looking something like this.

phdr0 6;
phdr1 3;
phdr2 1;
phdr3 1;
phdr4 2;
phdr5 4;
phdr6 0x6474e550;
phdr7 0x6474e551;
phdr8 0x6474e552;

After all that. we then simply m4-include a second file describing the custom ones, and use m4 to expand out the program header assignments. Our output section before has now become

  .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) } :phdr2

i.e. recording that the read-only data is in program header named phdr2 (the first LOAD in the dump we saw early on; this is the text segment, which also holds read-only data).

When we re-link with this script, we can add whatever extra linker inputs we want to go into our custom segment, and also apply a diff to the linker script. A simple diff looks like this, where the stuff in sections whose names begin  .insert will end up in a new segment.

--- default-dynexe.lds  2016-09-14 23:12:31.000000000 +0100
+++ insert-dynexe.lds   2016-09-14 23:13:46.000000000 +0100
@@ -187,6 +187,9 @@
   . = ALIGN(64 / 8);
   _end = .; PROVIDE (end = .);
   . = DATA_SEGMENT_END (.);
+  /* srk: new segment! */
+  .insert  : ALIGN(0x1000) { *(SORT_BY_NAME(.insert*)) } :insert
   /* Stabs debugging sections.  */
   .stab          0 : { *(.stab) }
   .stabstr       0 : { *(.stabstr) }

The .insert output section will also be aligned to a 4kB boundary, and be placed in the program header called “insert” by the script. If the input section is allocatable, it will also go into a LOAD header. By varying this patch, you can put whatever section you like in your program header, aligned however you like.

There are some warts in all this. One is that I found increasing the number of program headers would confuse the GNU linker into placing some of its implicitly generated sections, .gnu.build-id and .note.ABI-tag, at a silly offset in the file, and that would mess up the address assignment, by causing the file header no longer to be mapped at virtual address 0. To hack around this, I inserted a one-line sed command which explicitly places the .note.ABI-tag section. Another wart was that the PHDR header was no longer generated properly: it was coming out with an address of 0, not of the executable's start address (0x400000). This turned out to be because no LOAD header was including the region of the file containing the file and program headers themselves. To hack around this, we add a little more rewriting magic to our shell scripts, to put the PHDR and FILEHDR linker attributes to any program header entry with type PT_PHDR, and also to the text segment's LOAD program header (usually the first), then updating the latter's file offset and memory addresses so that the segment actually spans that part of the file.

I mentioned that I wanted to use custom program headers to embed one ELF file (a metadata object) inside another. How can we do this? Firstly, we embed the metadata ELF file into the enclosing ELF file and mark it with a custom program header. Secondly, at run time, when we want to load the metadata, we locate the program header and get a file descriptor pointing at the embdded content. Then we need a dynamic linker that can dlopen() from a file descriptor; we can hack the existing GNU dynamic linker for this fairly easily (not shown, but I've done it; ask me). Finally though, we need to handle the fact that file offsets in the embedded content will not match file offsets on this file descriptor! So we need to translate all the ELF file offsets in the embedded file by a fixed amount. File offsets live in the ELF header, program headers and section headers. (They might also live in other sections, but I'm not aware of any.) I wrote a simple C program to add a fixed offset to these. There's also some guesswork needed to figure out what that offset should be, since the linker will be in control of the in-file placement of the extra output section. This could be made reliable using one more link pass (bringing the total to three).

There's a GitHub repository containing these recipes and examples, here. It's all a bit GNU-specific and underdocumented for now.

All this effort to achieve a fairly simple task: package one ELF object inside another so that the “inner” object is loadable when running the “outer” object. What's another way of looking at this problem? We can think of ELF files as describing an initial state of the program (or, in the case of a library, a fragment of state). What we'd like them to be able to do is specify that this state includes a given file of a particular name and contents. In other words, we'd like them to describe not only aspects of the memory state, but also aspects of the filesystem state. Linkers already do something a bit this, namely static linking: objects that might not be on the destination filesystem can be bundled into the file, albeit losing their identity as files. Thinking of binaries as “partial initial program states” in this way, there's a clear relationship between ELF executables, Docker images and other collections of memory(-mappable) objects.

Our “embedded ELF file” requirement would be trivial if segments were nameable as files in the first place. This is of course a design feature of Multics, as was dynamic linking. Although the latter has now been picked up by Unix, files and segments have still not been fully re-unified. Maybe it's time? Dynamic linking is one of two key mechanisms, along with filesystems in userspace (whether this or this) that allow us to prototype extensions to Unix without pushing our changes into the kernel. There are usually performance, security or correctness reasons why such extensions are better off ultimately residing at least partly in the kernel. But making a playground entirely in user space is valuable for research and experimentation, and is something I'm working up to. The “embed part of a filesystem inside your binary” idea is one example of the sort of feature that we want to be easily prototyped in such a system. The core features we need—serving the content, and then binding it into the process's namespace”are already provided by the OS, so it's a matter of joining them together.

[/devel] permanent link contact

Mon, 30 Jan 2017

Debugging with the natives, part 2

Some aeons ago, I wrote a piece giving a rough outline of how a native debugger like gdb works, and promised a follow-up that looks at the same pile of techniques in a more principled way. I can't really excuse the delay, but if anyone's still listening, here goes.

Source-level debugging of native code is supported on (I make it) four different different levels in a complete hardware–software stack.

We can split these four into two pairs. The first two implement machine-level primitives and their “upward mapping” (virtualisation) into the operating system's process abstraction. The second two implement the source-level view that the programmer usually prefers, again mapping upwards from binary- to source-level.

Let's take the machine-level pair to begin with. From the operating system's point of view, all debugging support is designed around the following principles. (Perhaps I shouldn't say “designed” since in reality they “grew”— and only became principles later.)

Some surprisingly strong properties result from this design. Firstly, debugging can be done from a remote process, perhaps on a separate machine from the debuggee, perhaps even a machine of a different architecture. Secondly, debugging can be done post-mortem. Thirdly, the same infrastructure works for many source languages—albeit trivially so far, since we've only seen how to get an assembly-level view. There are some contrasts here with most language virtual machines (think JVM): these implement debugging using in-VM debug servers. These can work across the network, but don't support post-mortem debugging, and typically bake in concepts from source languages.

That's enough about the machine-level part. To go from machine- or assembly-level debugging to source-level debugging, we need help from the compiler. This is designed around the following principles.

Let's do a more concrete run-through of how it works. So far I've been fairly generic, but let's fix on GNU/Linux as our modern Unix—though all ELF-based systems are pretty similar—and a familiar architecture (x86-64) and specific metadata format (DWARF).

When I compile a program with -g it means “please generate metadata”. First, let's try without.

$ cc -o hello hello.c
$ readelf -S hello | grep debug  # no output! no debugging sections

You can still debug this program, at the assembly level, because the OS debugging mechanisms remain available. It's as if the compiler-generated assembly is code that you wrote manually by yourself. You can set breakpoints, watchpoints, single step, and so on.

$ gdb -q --args ./hello
Reading symbols from ./hello...(no debugging symbols found)...done.
(gdb) break main
Breakpoint 1 at 0x40052a
(gdb) run
Starting program: /tmp/hello 

Breakpoint 1, 0x000000000040052a in main ()
(gdb) disas
Dump of assembler code for function main:
   0x0000000000400526 <+0>:     push   %rbp
   0x0000000000400527 <+1>:     mov    %rsp,%rbp
=> 0x000000000040052a <+4>:     mov    $0x4005c4,%edi
   0x000000000040052f <+9>:     callq  0x400400 <puts@plt>
   0x0000000000400534 <+14>:    mov    $0x0,%eax
   0x0000000000400539 <+19>:    pop    %rbp
   0x000000000040053a <+20>:    retq 

Now let's compile with debug information.

$ cc -g -o hello hello.c
$ readelf -S hello | grep debug
  [28] .debug_aranges    PROGBITS         0000000000000000  00001081
  [29] .debug_info       PROGBITS         0000000000000000  000010b1
  [30] .debug_abbrev     PROGBITS         0000000000000000  00001142
  [31] .debug_line       PROGBITS         0000000000000000  00001186
  [32] .debug_frame      PROGBITS         0000000000000000  000011c8
  [33] .debug_str        PROGBITS         0000000000000000  00001210

What's in these debug sections? There are three main kinds of information. Firstly, there are files and line numbers (.debug_line). These encode a mapping from object code addresses to source coordinates (file, line, column). You can dump it fairly readably, as follows.

$ readelf -wL hello 
Decoded dump of debug contents of section .debug_line:

CU: hello.c:
File name                            Line number    Starting address
hello.c                                        4            0x400526
hello.c                                        5            0x40052a
hello.c                                        6            0x400534
hello.c                                        7            0x400539

Secondly, there is frame information (often this comes out in a section called .eh_frame so I cheated a bit above; to get exactly the above with a current gcc, you should add the -fno-dwarf2-cfi-asm switch). This tells the debugger how to walk the stack. What does walking the stack mean? Roughly it means getting a sequence of stack pointers paired with their program counter values, for each call site that is active on the callchain. The old stack pointer and program counter are always saved somewhere, otherwise we couldn't return from the call. To walk the stack we start from the current “live” register file given to use by ptrace(), which holds the “end” stack pointer and program counter. The DWARF then describes how to “rewind” these register values, and/or any other registers whose job is to record the callchain (rbp on x86-64; other callee-saves are often included too) back to their state at the previous call site in the chain. The description of this unwinding is logically a table, which you can see below for the main function. The cells are expressions describing how to compute the caller's value for a register, here rbp value (frame pointer) and also the caller's program counter, i.e. the return address (given as ra). The computations are both factored into two. Firstly we calculate a “canonical frame address” from the current frame's register values (see the CFA column): it's a fixed offset from rsp and rbp, and is actually a fixed address on the stack, but the expression changes from instruction to instruction as the stack pointer gets adjusted. Secondly we obtain the saved values we want by reading from the stack at fixed offsets from that address (c-8 means 8 bytes down). This factoring helps compactness, because the CFA-relative offsets don't change when the stack pointer moves; only the CFA column needs to describe that. However, although “stored at some offset from the CFA” covers a lot of cases, sometimes more complex computations are required, which usually appear as DWARF bytecode expressions.

$ readelf -wF hello
00000088 000000000000002c 0000001c FDE cie=00000070 pc=0000000000400526..000000000040053b
   LOC           CFA      rbp   ra    
0000000000400526 rsp+8    u     c-8 
0000000000400527 rsp+16   c-16  c-8 
000000000040052a rbp+16   c-16  c-8 
000000000040053a rsp+8    c-16  c-8 

The .debug_info section is the biggie. It describes the structural detail of the source program along both source and binary dimensions. It has a list of source files, but also a list of compilation units. The latter is where most of the structure is. It describes functions/methods, data types, and all the language-implementation decisions that the compiler took when generating binary code: how data types are laid out, which registers or stack slots hold each local variable over its lifetime, and so on. Although not shown much in the simple case shown below, addresses of program variables are described in a Turing-powerful stack machine language which is essentially a bytecode; the DW_OP_call_frame_cfa below is one operation, which simply says “push the address of the frame base, as recorded by the frame info”. The tree-like structure of the information also describes detailed static structure of code, including function inlining, the in-memory locations corresponding to particular lexical blocks in the code, and so on. (It's worth asking whether DWARF info it might usefully bundle the source code itself. I've never seen this done, but it would make a lot of sense to me.)

$ readelf -wi hello
Contents of the .debug_info section:

  Compilation Unit @ offset 0x0:
   Length:        0x8d (32-bit)
   Version:       4
   Abbrev Offset: 0x0
   Pointer Size:  8
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_producer    : (indirect string, offset: 0x2f): GNU C 4.9.2 -mtune=generic -march=x86-64 -g -fno-dwarf2-cfi-asm -fstack-protector-strong
    <10>   DW_AT_language    : 1        (ANSI C)
    <11>   DW_AT_name        : (indirect string, offset: 0x88): hello.c
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0xb5): /tmp
    <19>   DW_AT_low_pc      : 0x400526
    <21>   DW_AT_high_pc     : 0x15
    <29>   DW_AT_stmt_list   : 0x0
 <1><2d>: Abbrev Number: 2 (DW_TAG_base_type)
    <2e>   DW_AT_byte_size   : 8
    <2f>   DW_AT_encoding    : 7        (unsigned)
    <30>   DW_AT_name        : (indirect string, offset: 0x0): long unsigned int
 <1><34>: Abbrev Number: 2 (DW_TAG_base_type)
    <35>   DW_AT_byte_size   : 1
    <36>   DW_AT_encoding    : 8        (unsigned char)
    <37>   DW_AT_name        : (indirect string, offset: 0xa2): unsigned char
 <1><3b>: Abbrev Number: 2 (DW_TAG_base_type)
    <3c>   DW_AT_byte_size   : 2
    <3d>   DW_AT_encoding    : 7        (unsigned)
    <3e>   DW_AT_name        : (indirect string, offset: 0x12): short unsigned int
 <1><42>: Abbrev Number: 2 (DW_TAG_base_type)
    <43>   DW_AT_byte_size   : 4
    <44>   DW_AT_encoding    : 7        (unsigned)
    <45>   DW_AT_name        : (indirect string, offset: 0x5): unsigned int
 <1><49>: Abbrev Number: 2 (DW_TAG_base_type)
    <4a>   DW_AT_byte_size   : 1
    <4b>   DW_AT_encoding    : 6        (signed char)
    <4c>   DW_AT_name        : (indirect string, offset: 0xa4): signed char
 <1><50>: Abbrev Number: 2 (DW_TAG_base_type)
    <51>   DW_AT_byte_size   : 2
    <52>   DW_AT_encoding    : 5        (signed)
    <53>   DW_AT_name        : (indirect string, offset: 0x25): short int
 <1><57>: Abbrev Number: 3 (DW_TAG_base_type)
    <58>   DW_AT_byte_size   : 4
    <59>   DW_AT_encoding    : 5        (signed)
    <5a>   DW_AT_name        : int
 <1><5e>: Abbrev Number: 2 (DW_TAG_base_type)
    <5f>   DW_AT_byte_size   : 8
    <60>   DW_AT_encoding    : 5        (signed)
    <61>   DW_AT_name        : (indirect string, offset: 0xb0): long int
 <1><65>: Abbrev Number: 2 (DW_TAG_base_type)
    <66>   DW_AT_byte_size   : 8
    <67>   DW_AT_encoding    : 7        (unsigned)
    <68>   DW_AT_name        : (indirect string, offset: 0xb9): sizetype
 <1><6c>: Abbrev Number: 2 (DW_TAG_base_type)
    <6d>   DW_AT_byte_size   : 1
    <6e>   DW_AT_encoding    : 6        (signed char)
    <6f>   DW_AT_name        : (indirect string, offset: 0xab): char
 <1><73>: Abbrev Number: 4 (DW_TAG_subprogram)
    <74>   DW_AT_external    : 1
    <74>   DW_AT_name        : (indirect string, offset: 0xc2): main
    <78>   DW_AT_decl_file   : 1
    <79>   DW_AT_decl_line   : 3
    <7a>   DW_AT_prototyped  : 1
    <7a>   DW_AT_type        : <0x57>
    <7e>   DW_AT_low_pc      : 0x400526
    <86>   DW_AT_high_pc     : 0x15
    <8e>   DW_AT_frame_base  : 1 byte block: 9c         (DW_OP_call_frame_cfa)
    <90>   DW_AT_GNU_all_tail_call_sites: 1
 <1><90>: Abbrev Number: 0

That's it for the tour. Let me finish with some reflections on what's good and bad about this way of doing things.

Descriptive debugging: the good

Why do debugging this convoluted, long-winded way, with all this metadata, instead of the apparently simpler VM way of doing things? In VMs, debug servers are integrated into the runtime, and offer a fixed, high-level command interface. This allows the VM's compile-time implementation decisions to stay hidden. There is no need to tell the debugger how storage is laid out, where local variables live, and so on, because the command interface is pitched at a higher level of abstraction; those details remain internal to the VM. This is convenient for the VM implementer, since generation of debugging information is onerous to implement. But it also requires cooperation from the debuggee, couples debuggers to a fixed command language or wire protocol, and presents strictly less information to the developer. While VM debuggers are designed around the abstraction boundaries of the source languages, metadata-based debugging actively enables descending through these boundaries. It is sometimes very useful for debugging tools to expose implementation details in this way. The most obvious case is when faced with a compiler or VM bug; the user would like to “shift down” to the lower level to inspect the assembly code or VM state. At other times, there are performance bugs that the developer has a hunch are about cache or paging effects; being able to see the raw addresses and raw memory contents can help here, even when the program is running on a VM.

Being highly descriptive, debugging metadata documents a large number of implementation decisions taken by the compiler, so is useful not only to debuggers but also to profilers, language runtimes (C++ exception handling is usually built on DWARF frame information), other dynamic analysis tools such as Valgrind family, and so on.

Debugging optimised code (without deoptimising)

Debugging metadata must describe optimised code. By contrast, VM debug servers typically arrange that debug-server operations only need to deal with unoptimised stack frames and at most simply-generated code (e.g. from a template-based JIT). Confusingly, even the “full-speed debugging” feature of HotSpot uses dynamic deoptimisation to get back to unoptimised code—the earlier approach was to run the whole program under the interpreter whenever you wanted a debuggable execution. In general, a debuggable VM instance must either refrain from optimisation, or know how to dynamically undo that optimisation when a debugger is attached. So, dynamic deoptimisation is not exactly “full speed”—unlike with native debuggers, execution still slows down significantly when a debugger is attached. Having the VM implement debug operations only over unoptimised code is a restriction that helps make the debug server simple, at some cost in debug-time performance.

The flip side is that VM debugging is pretty good at precisely maintaining the source-level abstraction (modulo VM bugs), without complicating the task of implementing optimisations. Meanwhile, in Unix-land, the debugging experience remains best-effort and only as good as the compiler-generated metadata, which is sometimes wrong or incomplete following complex transformations. When optimisation and debuggability are in tension, debuggability usually takes the hit, so a smooth debugging experience still sometimes relies on a separate unoptimised “debug build”. Tail call optimisations are a classic debugging-hindering optimisation, since they rely on eliding stack frames, meaning the debugger cannot know how many recursive calls are logically on the (source-level) stack. Instruction scheduling is another: the order of operations in the executed code need not match source program order, and this can make for debug-time confusion.

The control problem

Some other properties are incidental but tend to be true of current Unix-style debugging. Debugger support for exercising the target language's control features (exceptions, threads, allocations, ...) is uneven, because this can't be described purely as metadata; there needs to be some interaction with the host runtime. DWARF and similar debugging information is good at the “statics” of describing how to decode the program state, but not good at describing the protocols of interaction with a language runtime, necessar for performing operations such as spawning a thread, allocating an object or throwing an exception. These tend to be difficult unless these happen to be cleanly exposed as entry points in the language runtime. In practice debuggers usually achieve these things by having magic knowledge of particular implementations.

At least one semi-portable interface has emerged with the aim of encapsulating run-time control operations for debuggers' benefit. I'm thinking of libthread_db, best described by Joe Damato's excellent article. Unfortunately it's an abomination, because it violates the principle that implementation details are described by architecture-independent metadata. An odd but cleaner and more consistent alternative would be to bundle snippets of DWARF bytecode for doing these runtime interactions—perhaps in the debug information of a language runtime, either simply calling into the runtime (for cleanly abstracted operations) or doing something more complex. But that is only a technical possibility; there are no proposals or working demos of that as far as I'm aware (maybe I'll make one). This might sound wacky, but if you know about the early history of Java, in Oak and the Green Project, you'll see a certain uncanny similarity in these ideas.

Levels of abstraction

Debugging at multiple levels of abstraction is a neat facility of Unix-style debugging, but is also a difficult power to control. It can be useful to switch down to the assembly-level view, or to switch languages, but this capability doesn't generalise to the case where many abstraction layers are built up within the same language (think C++). The debugger will let you “continue to the next source line”, but it doesn't know how to keep you at the same abstraction level. If the next source line is deep inside a library implementing something fairly basic like a smart pointer, it will skip only as far as that line, whereas you probably wanted to stay roughly at the same level of abstraction, or perhaps within the same codebase. Things get particularly bad when there is a lot of inlining (again with C++). The traditional “step over” and “step into” are hints at this need, but are too crude.

Doing better is currently beyond the debugger's ken, but this problem could be solved: perhaps by bringing in the knowledge of libraries and source file trees that the debugger already has, or perhaps most simply by allowing programmers to manually mark out the boundaries between layers. This could be a simple partitioning over source files and binary objects, or could be something more complex, perhaps sensitive to calling context or argument values (consider the case of the same library used from two places in the same application). “Next”-style operations could then be defined in terms of these layers. I'd love to see this, although the details would take a lot of working out.

To be continued?

There's plenty of research to be done on more debuggable language implementations. This research area doesn't seem to get the attention it deserves. One problem is that debugging is usually an afterthought for researchers, even though it is essential for programmers. Another problem is that not many PL researchers have sight of the design space; they're familiar with either the Unix-style approach or the VM-style one. I hope that in the future we can figure out how to get the best of both worlds.

[/devel] permanent link contact

Wed, 21 Sep 2016

Project suggestion: run-time type information in the FreeBSD kernel

It's past time I wrote up some Part II project suggestions. This post is about a new one. See also one from last year about a debuggable implementation of OCaml, also my suggestions from 2014, another from 2014 and other standing suggestions.

My liballocs project is a library and toolchain extension that allows tracking of run-time allocation types in large C programs, precisely and efficiently. On top of this, the libcrunch system uses this information to do run-time type checking, such as checking that a pointer cast matches the type of the actual pointed-to object. The idea is to make C coding a lot more easily debuggable, by providing clean failure messages at run time, rather like ClassCastException or ArrayIndexOutOfBoundsException in Java. There are various other applications in debugging and program comprehension, since liballocs effectively adds an oracle capable of answering the question “what is on the end of this pointer?”.

Both of these systems have been developed for user-space code but could usefully be applied in the kernel. The toolchain extensions, which gather type information from binaries and analyse allocation sites in C source code, should work with kernel code mostly unchanged. However, the liballocs runtime is written against user-level interfaces and cannot be used as-is.

There are two obvious approaches to creating a liballocs runtime suitable for the kernel.

One is simply to modify the existing liballocs runtime to support kernel-side interfaces. This should yield a liballocs.a library that can be linked into the kernel, adding code to observe the kernel's allocations and maintain type metadata on them as it executes. The main differences will be how the runtime allocates memory, and how it locates and loads type information. Likely, the latter will be trivial (the type information can simply be linked in to the kernel), except perhaps for loadable modules (they must be created with care, so that they contain any additional type information). The former is more complex, since the current runtime makes extensive use of Linux's unreserved virtual memory (via MAP_NORESERVE) which is not available as such in the FreeBSD kernel, although equivalent alternatives are likely possible with some hacking.

The other option is to use DTrace to write a non-invasive “monitor” process, started early during boot, that captures equivalent information to the liballocs runtime. One could then implement a client library which mimics the liballocs API and supports the same kinds of query. This approach is likely simpler to get started, and allows for a completely external, non-invasive client that can be used without recompiling the kernel. However, it is less reliable because it is likely to miss certain kernel allocation events, such as those happening early during kernel initialisation, or those that DTrace drops because of buffer contention. These will lead to incomplete type information. It is also unlikely to be able to answer queries coming from the kernel itself, so is useful for debugging and comprehension tasks (queried by a user-space program) but not for run-time type checking.

Both versions will require careful understanding of the FreeBSD kernel's allocation routines, for which some familiarisation time must be allowed. The DTrace option would be a good vehicle for exploring this, after which an executive decision could be made about whether to switch to the full in-kernel option.

[/research] permanent link contact

Thu, 25 Feb 2016

Debugging with the natives, part 1

By popular request, I'm going to run a couple of articles explaining how native Unix debuggers like gdb or lldb work. In this one, we'll take a quick cross-section of what's going on. I'll say “gdb” but I could be talking about any similar debugger.

Starting things up

When you start a program in a debugger like gdb, it does a variation of the usual fork()exec() trick, in which it sets itself up as a tracer of the child process before it exec()s the program you want to run. Alternatively, the same API can attach to a process already running. Either way, the effect is that any signals received by the “tracee” process will be delivered first to the debugger, and any system calls made by the tracee will also generate a signal. These signals can be inspected, discarded or modified as the debugger wishes, using the ptrace() API. In effect, the process (tracee) and the debugger (tracer) execute by mutual hand-off, suspending and resuming each other.

Walking the stack

Using the same API, gdb can access the raw memory image of the debugged program, and its register file. (Note that when gdb is active, the program is suspended, so its registers are saved in memory.) It can use this to walk the stack and print a backtrace: the register file gives us the program counter and stack pointer. If we have a nice simple stack which saves frame pointers, it simply walks the chain in memory. To print symbolic function names, it can use either the symbol table (if we must) or the compiler-generated debugging information (better; more on that in a moment).

Reading local state

To print a local variable, the debugger uses the same compiler-generated debugging information to look up where that variable is currently located. This lookup is keyed on the program's current program counter, to handle the way that locals get moved around between registers and the stack from instruction to instruction. The result of the lookup might be a memory location, a register number, or even, sometimes, something composite like “first 8 bytes in register X, rest in memory location Y”.

Setting breakpoints

To set a breakpoint on a particular source line, gdb uses a different part of the debugging information which encodes a mapping between binary locations (program counter values) and source file/line/column coordinates. A small subset of program counter values are specially identified as “beginning of statement” instructions; these are normally the ones the user wants to break on. Sometimes, hardware debug registers can be used to implement breakpoints (up to some maximum); the CPU generates a trap on reaching the program counter value stored in the register. More classically, the “soft” method is for the debugger to overwrite the instructions with trap instructions, saving the old instruction. This trap will cause a signal to be generated when the program executes it.

Resuming from breakpoints

To resume from a software breakpoint, a bit of juggling is required. Recall that we overwrote the original instruction. We can replace it, then ask the OS to single-step the program (using the hardware's single-step mode), then re-set the trap. This is racy in multithreaded programs when the other threads aren't stopped. So instead a good debugger will emulate the instruction itself! This means using an internal interpreter for the target instruction set, backed by the memory and register file access that it already has via ptrace().

Conditional breakpoints and expression evaluation

If we have a conditional breakpoint, we also need to be able to evaluate expressions when we take the trap (and silently resume if the expression is false). For this, the debugger has a parser and interpreter for each source language it supports. These interpreters are very different from an ordinary interpreter: the expression's environment comes from the binary contents of the debugged program's image, rather than being a structure the interpreter itself maintains. (The interpreter keeps a small amount of local state, such as for subexpression results, but it's mostly in the debuggee.)

Calling functions

To evaluate expressions containing function calls, we need to be able to run code in the debuggee. To do so, we craft a special frame on the debuggee's stack, a safe distance away from its actual top-of-stack. Having evaluated the arguments as usual, we put them in the relevant registers (saving what was there), tweak the saved stack pointer to point to this crafted state, set the instruction pointer to the callee, and set the debuggee going again. The function runs as normal, and on return it uses an on-stack return address that the debugger supplied. This points at a breakpoint-like instruction that raises a signal, returning control to the debugger. This then cleans up the stack as if nothing happened, and resets the registers accordingly. (Disclaimer: I haven't actually dug through gdb's code here, so I might have one or two things subtly wrong. At least, the above is how I'd implement it. Let me know if you know better.)

It goes on...

There's lots more, but if you understand the above, you've probably got the gist.

All that seems like a big bag of tricks. What are the principles at work in this design? Although the system has “grown” rather than being “designed”, and has a generous share of quirks and unstandardised corners, there are some surprisingly strong principles lurking in it. I've written a little about these in my paper at Onward! last year. Next time, I'll cover briefly some of the same observations, and a few others, hopefully thereby re-explaining how debuggers work, a bit more systematically .

[/devel] permanent link contact

Wed, 30 Sep 2015

Project suggestion: an observable OCaml, using liballocs

It's the season to suggest student projects, and this post is about one which follows up a series of blog posts (1, 2, 3, 4). Feel free to see also my suggestions from last year, another from last year and other standing suggestions

This post is about a suggested project which will produce a working implementation of the OCaml programming language, using an existing front-end but a very different back-end.

Unlike the mainstream implementation, this implementation will be optimized for debuggability, or more generally “observability”. The compilation strategy will be to translate to C, hence obtaining the debuggability (and, to some as-yet-unknown extent, the performance) of the C compilation toolchain.

This means at least three key differences. Firstly, all locally-bound names will map to C local variables. Secondly, OCaml's lexical closures will be implemented in a highly compatible way, perhaps using the technique of Breuel (which is already implemented in the GNU C compiler, but with stack allocation, whose lifetime semantics are not appropriate for OCaml). Thirdly, all allocations will have associated type metadata at run time, using liballocs This will allow parametric polymorphism to be “seen through” by a debugger—so it can tell you, for example, that some stack frame of a function whose type is 'a -> 'a is actually activated with 'a as int, say. (This may or may not be the right formulation of “seeing through”—stronger and weaker versions are possible; ask me!)

Key ingredients are the following:

Optional “extension”-style challenges might include the following:

Evaluation of the system is mostly via performance—the goal would be to get within a reasonable factor of the OCaml native-code compiler. We can also do some experiments to measure observability of our system. One simple experiment might interrupt both versions of a program at randomly chosen call sites and count how much of the local state we could recover (number of frames in the backtrace, number of locals in those frames; perhaps even doing a diff on their values). Call sites are a good choice because all but one frame of the stack is always stopped at one, and instrumenting the native code so that we can do lock-step breakpointing, by counting call invocations, should not be too difficult (though it might mean disabling inlining, which does affect what is being measured). Usually there'll be no contest, since the original OCaml native compiler doesn't let you observe locally bound values at all. There is, however, a development branch which does, and comparing with the bytecode debugger (ocamldebug) is also a good bet.

[/research] permanent link contact

Wed, 16 Sep 2015

Partially evaluating a bytecode interpreter using C++ templates

I should start this by saying that I'm definitely not a C++ metaprogramming genius. People have done outrageously clever and diverting things with C++ templates, and I am not about to outdo anyone. Nevertheless, this might be interesting.

I was facing the task of taking code in a simple but obscure stack machine language—the stack machine from DWARF, if you couldn't guess—and turning it into vaguely efficient machine code. My thinking was spurred on by systems like Truffle, or PyPy's tracing JIT, which produce compiled code from only source code and an interpreter for the source language. Unlike those systems, I didn't need dynamic compilation—offline is fine. But also, I wanted a system with easy support for C and C++ intrinsics, zero runtime baggage, and something I could prototype in a day or so. It turned out that writing the interpreter in C++ templates, and using the C++ compiler as a partial evaluator, was reasonably successful at all these. Here's how it works.

To start with, I simplified right down from DWARF to an analogous but much simpler stack machine where all operands are passed on the stack. We model instructions using simply an enumeration. This is not essential, or even desirable, except for exposition.

/* instruction -- a simple enumeration */
enum instr
    PLUS, /* add top two stack slots and push result */
    BZ,   /* pop val and dest off stack; if val is zero, branch to dest */
    PEEK, /* pop an address off the stack and push its contents */
    DUP2  /* clone the top two elements of the stack */

[I've edited this slightly from my initial version of this post, to add the DUP2 instruction. This allows me to include a slightly more meaningful example program—thanks to Ross Bencina for prompting that.]

The four instructions are representative of the classes of operation I care about: arithmetic, control, C intrinsic and stack manipulation. They're not a very complete machine, but they suffice to illustrate things. To witness the power of this somewhat-operational stack machine, a simple program which combines all four opcodes is as follows.

typedef program< PLUS, program< DUP2, program< BZ, program< PEEK, void > > > > prog;

This program add two numbers taken as input, and if they sum to something non-zero, it interprets that as an address and peeks it, returning what's stored there. Otherwise it returns zero. Note that since we only have a computed branch instruction, not a direct branch, the branch target exists as data in the initial stack. From top to bottom, our initial stack must be [arg1, arg2, 4], where arg1 and arg2 are the two input values, 4 is the program counter to branch to (the end of the program). We don't need to store the literal zero in this case (I'll let you figure out why).

So, clearly, a program is a sequence of instructions. We model it in the C++ compiler as a template-level sequence, with the opcode as one template argument and the program “tail”, i.e. the following instructions, as another template argument.

/* program -- a sequence of opcodes */
template <unsigned Op, class Next>
struct program
    static const long val = Op;
    typedef Next tail;

In effect, this gives us a “linked list” structure but at the template level. We can write some simple helper functions over this kind of structure, using a functional style in which template specialization is how we do case-splitting and template instantiation is how we do recursion.

template <class Seq> 
struct sequence_length
    static const long value = 1 + sequence_length<Seq::tail>::value;

template <>
struct sequence_length<void>
    static const long value = 0;

/* Utility for getting the tail of a type-level sequence. */
template <typename Seq>
struct tail_of
    typedef typename Seq::tail type;
template <>
struct tail_of<void>
    typedef void type;

/* Utility for getting the nth of a sequence, then the rest;
 * -- it's like "let s = drop n Seq in (hd s, tl s)" */
template <unsigned n, class Seq> // n > 0 case
struct nth
    static const long val = nth<n-1, typename tail_of<Seq>::type>::val;
    typedef typename tail_of< nth<n-1, typename tail_of<Seq>::type> >::type tail;
/* termination case: n == 0 */
template <class Seq>
struct nth<0, Seq>
    static const long val = Seq::val;
    typedef typename tail_of<Seq>::type tail;

The main use of sequences is to model the program itself, a sequence of instructions. We also use them to model the program counter, encoding it simply as the “tail” of the program, i.e. the subsequence of instructions from the current position to the end of the program. (This won't stop us from jumping backwards, because we'll always pass around the whole program too. I'll come on to jumps a bit later.)

What's our program state? It's a program counter and an operand stack. Since, in general, our stack will not be statically known, we have to model it at the value level, passing it around as “real” arguments at run time rather than template arguments. (I did create a version where a parallel family of template specializations could be used to encode statically known stacks. That's neat because it you can get the C++ compiler to partially evaluate the stack-machine program for known input. But I'm skipping that extra complexity here.)

To actually represent the stack at run time we'll use a plain old array of long. I'll show you that in a second. Before I do, let's start to piece together what our evaluator looks like. It's a template that takes arguments standing for the whole program and the program tail, i.e. program counter. The whole program is always statically known, so is a template argument. There are finitely many possible values of the program counter, so we can make the compiler enumerate them; this means the program tail can be a template argument too. (This works nicely with how we do branches; I'll come on to that.)

/* States in a computation */
template <
    /* Program */ typename Program,
    /* ProgramTail */ typename ProgramTail    /* i.e. a suffix of the Program, 
                                                 beginning with the next instruction to run. */
struct state
    static long eval(/* some args... */)
        /* figure out our next state and delegate to it... */

We also want each state to support an eval() method which executes the program's continuation from that state. Most of the time this will work by calculating the next state and delegating to its eval() method. For states that are “end states”, like when we hit the end of the program, we'll instead return something—the value on the top of the stack. We can handle this case using another specialization (which we'll define later).

A simple opcode's eval() method looks something like the following, which implements PLUS. We pass the stack as an argument to the eval() method, in the form of two pointers, bottom and top.

template <
    /* Program */ typename WholeProgram,
    /* ProgramTail */ typename ProgramTail
struct state< WholeProgram, program<PLUS, ProgramTail> >
    static long eval(long *bottom, long *top)
        top[-1] = top[0] + top[-1];
        return state<WholeProgram, ProgramTail >::eval(bottom, top - 1);

Here we destructively update the stack, decrementing the top. Note that the whole interpreter works in continuation-passing style, and the current stack is always passed around explicitly and linearly—all our eval() calls are tail calls. Currently we malloc() the stack eagerly to a large-ish size, and abort on overflow. But linearity means we could even start small and then realloc() it if it overflowed. A neater trick would be to make it an actual stack that is lazily allocated by the OS, using mmap()—this would give us gradual allocation without cluttering our instruction stream with overflow checks or realloc() calls (but I haven't implemented this yet).

That's the basic idea of our evaluator: each opcode can be implemented by a template specialization which gets elaborated differently according to its position in the program. What about branches? We can handle branches by a single specialization for the case where the PC is not known statically. As before, we deal with the not-statically-known value by accepting it as a method (value) argument rather than a template (type) argument. The simplest and fastest way to do this is to code up a jump table. For this to work, we also have to bound the size of programs.


struct run_time_pc {};

template <
    /* Program */ typename WholeProgram
struct state< WholeProgram, run_time_pc >
    static long eval(long *bottom, long *top, long pc)
        switch (pc)
#define CASE(n) \
case n: return state< \
        WholeProgram, \
        program< \
            /* We build the tail starting from the nth, using the definitions in `nth<>'. */ \
            nth<(n), WholeProgram>::val, \
            typename nth<(n), WholeProgram>::tail \
        > \
    >::eval(bottom, top);
            /* ... */
            /* MAX_PROGRAM_SIZE */
                assert(pc &gt;= MAX_PROGRAM_SIZE); 
                warnx("branched out of bounds");

Now, to implement a branch opcode, all we need to do is have it call this run_time_pc specialization's eval(), which takes an extra argument: the destination program counter, passed as a value. Passing it as a value is what allows the jumping instruction to actually decide the destination at run time. At worst, the C++ compiler compiles this switch statement down to a jump table which can jump to the code implementing any of the program's instructions. If this happens, the compiler must enumerate all these destinations up-front in order to build the jump table, hence why we had to bound the program size. At best, though, if we're doing a direct branch, the branch target will be statically known in the caller of this eval() template, allowing the C++ compiler to optimise away the dead switch cases, replacing the jump table with a direct branch. (In our current stack machine, all branch destinations are encoded on the stack, so are potentially data-dependent, meaning we don't hit this direct-branch case. But in the actual DWARF stack machine, we can encode direct branch targets into the instruction, so we will hit this case.)

Since each entry in this jump table jumps to a specific part of the program, we immediately transition back to having a statically known PC value. Our implementation of branches is completely self-contained within this one mechanism.

Actually, make that two mechanisms, because I want to add an alternative. We can avoid hackily bounding the size of the program, if we don't mind rewriting our jump table as a chain of ifelse branches. We can use templates to generate this chain of branches, for any length. To do this, we parameterise our run_time_pc type on the upper bound of what program counter value we might be jumping to. Initially this is the length of the program; recursive elaborations' ifelse-tests whittle this down to zero, the lowest-numbered place we could jump to. In this way, the compiler effectively builds our switch statement (as an if–else chain) for us. As before, if it's a direct branch, all the tests will get elaborated away. If it's not, whether we get a nice jump table depends on how clever the compiler is. For me, sadly g++ seems not to generate a jump table, even at -O3, but perhaps it can be coaxed into it somehow (let me know if you know!).

template <unsigned UpperBound> 
struct run_time_pc {};

template <
    /* Program */ typename WholeProgram,
                  unsigned MaxPc
struct state< WholeProgram, run_time_pc<MaxPc> >
    /* This is fast only if the C++ compiler can turn if--else chains back into jump table. */
    static long eval_if_else(long *bottom, long *top, long pc)
        if (pc == MaxPc) return state<
                /* We build the tail starting from the nth, using the definitions in nth<>. */
                nth<MaxPc, WholeProgram>::val,
                typename nth<MaxPc, WholeProgram>::tail
        >::eval(bottom, top);
        else return state<WholeProgram, run_time_pc<MaxPc - 1> >::eval_if_else(bottom, top, pc);
template <
    /* Program */ typename WholeProgram
struct state< WholeProgram, run_time_pc<0> >
    static long eval_if_else(long *bottom, long *top, long pc)
        if (pc == 0) return state<
                /* We build the tail starting from the nth, using the definitions in nth<>. */
                nth<0, WholeProgram>::val,
                typename nth<0, WholeProgram>::tail
        >::eval(bottom, top);
        else abort();

I promised to show how we deal with the end of the program. We just specialize for the case where the program tail is void, i.e. the empty sequence.

/* This specialization handles "end of program". */
template <
    /* Program */ typename Program,
struct state<Program, void>
    static long eval(long *bottom, long *top)
        if (top <= bottom) abort();
        return top[0];

There are a few more fiddly things to figure out to make it all compile, but I've shown you all the interesting stuff. A working snapshot of the implementation is here. It compiled only with g++, not clang++, the last time I tried; fixes are welcome. It implements everything I've mentioned. In the near future I'll turn it into an actual DWARF compiler, and it will appear in liballocs. You might ask: why? In short, compiled DWARF expressions are hopefully a way to achieve faster introspection on complex memory layouts. The liballocs metadata toolchain already “compiles” much DWARF content down to an efficient in-memory representation. By compiling not only manifest DWARF structure but also computational DWARF expressions, we can efficiently introspect on structures whose layout can only be described precisely via a computation—think variable-length arrays, hash tables, and so on. Eventually I also want to support “compiled frame information”, which I hope will allow faster stack walking.

[/research] permanent link contact

Thu, 16 Jul 2015

ELF introspection, robustly and portably

I often find myself writing code that does introspection at the ELF level. This is for things like looking up symbols, walking the link map, figuring out which shared object a particular memory address belongs to, which memory regions are executable, and so on. These are fairly common things to want to do in ptrace()- or LD_PRELOAD-based tools, or binary interpreter-like (Valgrind-style) ones. This post is about how to do these things robustly from within a process, where “robustly” brings some particular restrictions.

What introspection?

What does our address space look like? To spare myself the pain of drawing diagrams, we can ask Linux to dump a somewhat-pictorial view of a simple cat process by doing cat /proc/self/maps. I have reversed the order of the lines so that higher addresses go higher up, and tweaked the formatting slightly.

          address range           flgs  offset   dev   inode                  pathname or other name   
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

    7fff301fe000-    7fff30200000 r-xp 00000000 00:00 0                  [vdso]

    7fff301a5000-    7fff301c7000 rw-p 00000000 00:00 0                  [stack]

    7f4a574bc000-    7f4a574bd000 rw-p 00000000 00:00 0 
    7f4a574bb000-    7f4a574bc000 rw-p 00023000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a574ba000-    7f4a574bb000 r--p 00022000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57493000-    7f4a574ba000 rw-p 00000000 00:00 0 

    7f4a57298000-    7f4a572bb000 r-xp 00000000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57293000-    7f4a57298000 rw-p 00000000 00:00 0 
    7f4a57291000-    7f4a57293000 rw-p 001bf000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5728d000-    7f4a57291000 r--p 001bb000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5708e000-    7f4a5728d000 ---p 001bc000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56ed2000-    7f4a5708e000 r-xp 00000000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56c09000-    7f4a56ed2000 r--p 00000000 08:05 2750795            /usr/lib/locale/locale-archive
        00838000-        00859000 rw-p 00000000 00:00 0                  [heap]
        0060c000-        0060d000 rw-p 0000c000 08:01 286233             /bin/cat
        0060b000-        0060c000 r--p 0000b000 08:01 286233             /bin/cat
        00400000-        0040c000 r-xp 00000000 08:01 286233             /bin/cat

This picture is a refinement of the usual OS textbook diagram of a Unix address space. As usual, we have the executable's text segment (bottom), read-only and then writable data (next two up), heap (next up), a stack (much higher) and the kernel region (up top). (Actually the kernel region is not shown, except for one small part, the vsyscall page, which is the only user-accessible area.) The biggest difference is that we have not only the executable but some libraries—primarily the C library (since cat doesn't use others), but also the dynamic linker (itself a library) and the mysterious vdso. All these are ELF objects. In general, a large subset of a process's structure is actually defined by the collection of loaded ELF objects and a little per-process structure also defined by ELF. Our problem is now: how can we acquire a view on this ELF-level structure, from within the program itself at run time, efficiently and robustly?

(The above listing also shows a few other flavours of mapping that won't concern us here: a memory-mapped file /usr/lib/locale/locale-archive, and two anonymous mappings, beginning at 0x7f4a57493000 and 0x7f4a57293000, which are almost certainly heap areas—we can have many heaps, not just the one that grows out of the executable's data segment. And of course, although we see only one mapping labelled [stack], in a large multithreaded process there will be many such.)

What's ELF?

The basics of ELF linking were first defined in System V Release 4 in 1988, and eventually adopted by non-commercial Unices starting in the mid-1990s (1995 for Linux, with kernel 1.2; 1998 for FreeBSD, with version 3; NetBSD and OpenBSD in 2001 and 2003 respectively). The “classic” picture of a loadable ELF file (direct from the spec) looks like the following.

Execution View
ELF Header
Program header table
Segment 1
Segment 2
Segment 3
Section header table

But actually, that omits some of the most interesting stuff! Let's try again.

Execution View
ELF Header
Program header table
Dynamic linking metadata
Dynamic symbol table
Dynamic relocation table
Segment 1
Segment 2
Segment 3
Section header table
Symbol table

Dynamic symbols, program headers and the miscellaneous dynamic linking metadata are what we're interested in introspecting on. The POSIX libdl API offers one interface for doing introspection at the ELF level, at least for looking up symbols by name. Non-portable extensions, like SunOS/Solaris's dladdr() and the later-added dladdr1(), together with GNU's dlvsym() and dl_iterate_phdr(), get us a bit further, walking program headers and looking up symbols by address or version. We get further still if we allow ourselves OS-specific features, like Linux's helpful maps file in the /proc filesystem that generated the dump above, together with the auxv file or its associated library call getauxval(). If we're happy to parse the ELF file itself, by doing file I/O, we can get at anything we like.

What's “robustly”?

There are problems with all of the preceding approaches. Portability is lacking from many—only the POSIX APIs and ELF file I/O are widely portable. Robustness is lacking from all of them, in the sense that it may not be safe or reliable to use them from “sensitive contexts” where ELF introspection may be required. The main kind of “sensitive context” is one where making system calls or library calls would cause unwanted reentrancy. For example, if we want access to metadata from instrumentation logic that runs in the middle of a malloc() call, calling out to a library which might itself call malloc() is a bad idea. If we're instrumenting the code paths that make system calls (ask me how!), calling out to the OS might also be a bad idea. These unintended reentrancies can easily cause infinite loops, deadlock or state corruption. There are also security reasons why system calls and library calls are less reliable than working entirely within the process: the filesystem might have changed since the filenames in our maps or link map were recorded, for example.

So, how much ELF introspection can we do without the help of system calls, library calls or non-portable code? The answer surprised me: quite a lot, but not quite everything. There is some low-hanging fruit, then some that we can just about reach (provided a couple of assumptions which, although not strictly portable, probably hold for most platforms), then some that absolutely no portable method can let us get (without resorting to file I/O).

Let's state our problem slightly more precisely. Our address space includes a selection of objects (the executable and all loaded libraries), mapped at particular addresses. We'd first like to enumerate these mappings, giving us a look-up from load addresses to the object's originating binary file (pathname)—more-or-less the structure we saw in the /proc/self/maps listing earlier. Furthermore, each object defines some symbols. We'd like to be able to resolve symbols in each object, as well as any other useful other metadata, like their soname, the libraries they depend on, the locations of support structures like procedure linkage table (PLT), and the relocations performed when they were loaded. This is what we saw in the elaborated ELF file picture. The symbols-and-addresses stuff is the most important; the relocation and miscellaneous information isn't useful as often, but we'd like to have it anyway. Finally, there's some useful metadata that the operating system has passed down about the process, in an ELF-defined structure called the auxiliary vector. It turns out that we'll use this to get at some of the other metadata, but some other parts of it are useful in its own right—they tell us the process ID, page size, and various properties of the hardware.

The link map

To enumerate the mapped objects, we need access to a structure called the link map. The portable interface for walking the link map is the mysterious r_debug protocol. This is surprisingly murky in its origins, but it probably comes from SunOS; it was definitely included in System V release 4 (the only solid reference I know is the relevant Programmer's Guide). It's actually not specified in most ABIs, although some seem to do so. The r_debug protocol defines a structure containing pointers to the link map. From an executable, it's easy to find this structure: we usually have a DT_DEBUG entry in PT_DYNAMIC, and the dynamic linker updates to points to a r_debug structure it allocates. This contains a pair of pointers to the link map, represented as a doubly-linked list.

What about from a shared library? Getting a pointer to the r_debug is trickier here, because shared libraries don't usually have a DT_DEBUG entry. This is a chicken-and-egg problem: we need the executable's r_debug to get to other objects, but if we're not starting in the executable, how do we get there? So here comes our first somewhat-non-portable assumption: that a global symbol _r_debug is available, and marks the r_debug structure in the dynamic linker. This symbol is commonly present, but I don't believe it's standard—POSIX generally specifies programmer-facing APIs, not linker-level symbols. (As we'll see shortly, another method for getting at it is available, but it is also non-portable and imposes some other requirements. Anyway, by combining these two options, we have a pretty good chance that we can find the link map. So, let's continue on the assumption that we have the link map.)


Once we have the link map, we'd like to do symbol lookups in each object. For this, we need (at least) the dynamic symbol tables, which we can find via the PT_DYNAMIC segments of each object. (We probably also want the symbol hash tables; the method is similar.) The link map helpfully keeps a pointer to this specific segment (rather than to the overall program header table). The segment's content consists of key-value pairs, and it's the pair whose key is DT_SYMTAB that points to the dynamic symbol table. (On most ABIs, a short cut to get a pointer to the containing object's PT_DYNAMIC segment is to make a link-time reference to the symbol _DYNAMIC. But this only lets us get the containing object's PT_DYNAMIC, whereas walking the link map gets us all of them.)

One caveat about symbol lookups is that only dynamic symbols are looked up. The other symbols, if there are any (not needed for dynamic linking, but useful for debugging), live in a different table, .symtab, which needn't be mapped at run time. Depending on how your object was linked, perhaps most symbols became dynamic symbols (--export-dynamic) or perhaps only a minimal selection did. If you've ever tried to look up stat() in the GNU C library, using dlsym(), and been surprised not to find it, this is why. The stat() function is one of a few that are always linked statically, even when the rest of the C library is linked dynamically. In general, not all code or data in our dynamic objects is actually described by dynamic symbols. I'll return to this shortly.

The auxiliary vector

The link map gave us a list of loaded objects and their base addresses, but it doesn't tell us where each individual segment is mapped. This is where we begin our voyage into the not-really-portable. In the case of the executable, we can get the program headers via another useful metadata structure: the auxiliary vector. This is created by the operating system to supply various information of use to startup code (typically the bowels of the C library) and the dynamic linker (which, if one is used, also mutates the metadata to make it look as though the executable was loaded directly). (If we failed to get the link map using the previous _r_debug symbol method, we can also use the auxiliary vector to get to the executable's DT_DEBUG, and hence to the link map.) Here's a very nice article all about auxiliary vectors.

The auxiliary vector lives high on the initial thread's stack, a fixed offset above the stack pointer when the program's first instruction receives control. If we could walk the main thread's stack right to the top, we could find the auxiliary vector easily. However, nowadays, since we lack frame pointers, that means making a library call to something like libunwind, and that might allocate memory. Even then, it's not guaranteed that we can walk the stack all the way to the top, since unwind information might be patchy up there. So we'll need a different approach.

I devised a nasty but ultimately pleasing hack for this. When we say “auxiliary vector” we really mean specifically a list of key-value pairs containing the useful metadata I mentioned. But there's also a bunch of other things up there, in a large contiguous structure (see the helpful article for a diagram): the environment strings (a blob of characters), argument strings (ditto), and vectors of pointers into these blobs (our initial environ and argv). Instead of walking the stack, can we get hold of a pointer into one of these blobs, and then parse the surrounding memory to find the start of the key-value pairs? It's very likely that we can, since we can expect to find a handy global variable, environ, that points into them.

Now, it's important to note that this isn't completely robust. The auxiliary vector only records the initial environment. A program can modify the current environment strings to its heart's content by using C library calls like putenv() and getenv(). These will mutate and, if necessary, reallocate the vector pointed at by environ. However, unless a program deletes its entire initial environment, and assuming the C library doesn't eagerly copy things out of it, that vector should always contain one or more pointers into the initial environment blob. So, we want to walk environ and look for such a pointer.

The next problem: how will we recognise such a pointer when we see it? For this, we need to fix some more very-likely-true assumptions. Firstly, we assume either that the initial stack is the highest user mapping in the process, or that if something is higher, it's a shared object (perhaps Linux's vdso) which we can identify in the link map. (Put differently: what's not allowed is for there to be a memory-mapped file above the initial user stack, other than a shared object. This doesn't seem too restrictive; we could even imagine an ABI-level guarantee from the kernel that it will never create such mappings.) This assumption gives us an upper bound to compare our environment pointers against. What about a lower bound? For that, we assume that the caller can supply us one, in the form of a pointer higher than any other stack-allocated string in environ. The address of a local variable in main() could be one such pointer. In fact, any address on the initial stack will probably be good enough, since putting a stack-allocated string in your environment would be a very quirky thing to do. Anyway, we suppose that our caller, the client code which wants to get hold of the auxiliary vector, can give us a such a pointer.

Now we're ready to go. We scan *environ for pointers to strings within these bounds. Then we navigate downwards in memory to find the end of the auxiliary vector, then keep going to find its beginning, after which we'll see the terminator of the argument or environment pointer vector.

Once we have the auxiliary vector, we have the executable's program headers, from which it's easy to get at most other things we need, again using the link map to access other objects.

Program headers

One challenge remains for us: getting at shared objects' program headers. The problem here is that these headers needn't, in principle, be found anywhere in memory. (The ELF format allows a file to require that its program headers are mapped, using PT_PHDR, but it's not required and its use is uncommon.) Typically a dynamic linker will always keep each object's program headers in memory somewhere, but it's not obliged to do so. Even if it does, that memory might be anywhere. The GNU dynamic linker sometimes stores just a malloc'd copy of the data, perhaps originally read through a file descriptor rather than a memory mapping. Either way, the pointer it maintains to the header data lives in private state (in extra non-public fields in the link map structure); there's no portable access to it. So, to get shared objects' program headers there's no avoiding the use of some platform-specific (and loader-specific) code to fish it out.

The major limitation of doing without program headers is that we don't have a complete map of the address space. The link map lets us enumerate objects and their start addresses, and we can then use the symbol tables to do a “forward lookup” from symbols to addresses. We can invert the symbol tables to give us a partial backward lookup from addresses to symbols. But we can't do quite as complete a backward lookup as the dladdr() function can. If you give dladdr() an address that falls within an object's mappings but isn't covered by an exported symbol, it can still tell you which object it's in. Only the program headers contain the information necessary for this. Another thing we can't figure out is which memory access permissions a segment was mapped with—again, that means looking at the relevant program header.

Maximising ELF introspectability

Armed with these observations, we could imagine dynamically rewriting the binaries on our system slightly to optimise for ELF introspectability. Firstly we would insert a PT_PHDR, and define a symbol (maybe _PHDR), roughly analogous to _DYNAMIC), to help us find it. Secondly, going back to the restriction that only dynamic symbols were available for lookup, we could export all .symtab entries as “hidden” dynamic symbols. The obvious objection to this is that it will slow down linking. However, symbol lookup by name happens via the hash table (DT_HASH or GNU_HASH). I haven't checked all the details yet, but it appears that adding them to the symbol table needn't necessarily mean including them in the hash table. There has to be a hash-chain “next pointer” per symbol, but it's permitted not to hook that entry into any bucket. Since link-time lookups occur via the hash table, leaving the hidden symbols out of the hash table seems to defeat the objection. They'd be present for a “deep” introspector to find, by scanning the symbol table. But they wouldn't interfere with the symbol lookup process, nor with code using dlsym().


If you'd like to use this introspection goodness, I've coded it all up in a single header, relf.h, which is fairly self-contained and should be self-explanatory. Please let me know if you find it useful!

[/devel] permanent link contact

Tue, 30 Jun 2015

A position on licensing research software

I have a habit of putting code on my github page without any statement of license. Sometimes this is oversight or laziness, but sometimes it's entirely deliberate. It's not because I don't want to open-source or otherwise license my code; I do. But, as this post covers in some depth, the Devil is in the details.

In particular, I believe that software whose purpose is scientific or researchy—to exist, to prove a concept, to be extended by other researchers perhaps, but not to acquire users—can be better off without an explicit license, at least while it remains relatively immature and has only a small number of contributors. In short, this is because there's little practical loss from not licensing at this stage, whereas by forcing a choice of license, it's possible to choose wrongly.

For some reason, this belief, at least when compressed into 140-character snippets, attracted quite a lot of opprobrium. So allow me to explain my justification and respond to the criticisms. Before I go further, I should add that I'm definitely not pretending to be an expert on licenses, law, tech transfer, aggregate properties of big companies' IP policies, or anything else I'm writing about. Everything you're about to read is informed by limited personal experience and anecdote. In fact that's how I got here: I'm very aware that licensing is something I don't fully understand. For me, that's a strong motivation for avoiding making premature commitments that I might later regret. You have the opportunity of enlightening me, and might even change my position—but first please understand what my position is and why I hold it.

So, here goes: I claim that early-stage scientific proof-of-concept software is often better off without an explicit license. Naysayers (some hypothetical, but mostly real) claim the following.

“But then people can't legally use your software?” They can, but they should ask first.

“But people rarely bother to ask.” That's fine; this software is not aiming at getting lots of users.

“But you might miss out on users.” I don't want users.

“But you might miss out on collaborators.” I doubt it. Collaborators have to communicate, not just consume. They get in touch. Or, more likely, I meet them through other channels than their browsing my repositories online.

“But people aren't allowed to look at it without a license.” No, that's not how copyright works—quite the opposite. The code is published, so you're allowed to look.

“But company policies mean employees won't be allowed look at your code, even though it's allowed by law.” I've worked at two large IP-paranoid companies, and in both cases employees would happily exercise discretion about whether to look at code lacking a permissive license. (This was true even if official guidance left no room for their discretion.) Dealing with third-party code was also a well-established part of corporate culture. Among other things, this makes these developers much more likely to ask for permission than others. During my three months at one such company, in a very junior position, I was encouraged to make enquiries not only about licensing third-party software that appeared useful, but also about perhaps contracting the developer to support our effort. This surprised me at the time. But being a corporate entity dealing in non-permissively-licensed IP makes you more amenable to dealing with other people's IP in whatever way is appropriate. As I'll cover shortly, the well-known copyright paranoia (including what I'll call “GPL terror”) arises because of cases where the license on offer doesn't suit the prospecting company, and where there's good cause to fear litigation.

“But if they look at your code, they risk creating a derived work, so could get sued by your employer!” Companies know the pragmatics of IP better than this. If I were talking about Big Company Labs's github page, I'd agree that explicit licensing would be highly desirable. Big companies tend to be litigious. In all cases, the question is: who is liable to get sued by whom? But we're not talking about that. The risk to a company from simply looking at my code (emphasis important) is miniscule. The risk to me and to the public good, from premature commitment to a given license, is non-trivial, because it risks closing off opportunities later.

“Why not GPL your code?” I'm not against doing so in principle. In fact, even though I'm told it's falling out of popular favour, the GPL is one of my favourite licenses. But if I GPL'd my code, many more companies' employees would be scared off. Many big companies' legal departments have spawned corporate policies which foster a terror of GPL'd code. Partly this terror is justified: much GPL'd code is copyrighted by large competing companies, who don't want to relicense it and do want to sue infringers. However, the terror is not justified in all cases. I'm not a competing company, I am amenable to relicensing, and I'm not likely to sue infrigers anyway. So companies have no reason to be terrified of looking at my code. Unfortunately, many company procedures are written in a blanket style, with rules like “don't look at GPL'd code!”. that are not sensitive to these nuances.

“How is not putting any license possibly better? It confers strictly fewer rights!” I know it's perverse. But since these “don't look at GPL” policies exist, the absence of any GPL notice would actually make it more defensible, according to these policies, for employees to look at my code in any depth. The real problem here is that, as far as my anecdotes can see, many corporate legal people don't even understand the basic properties of the domain. Software doesn't “have” a license; it is licensed and can be licensed differently to different people. I sometimes say that licensing is “bilateral”, in an attempt to convey this multi-party “edge, not vertex” nature. Being GPL'd doesn't exclude other licensing possibilities. Sadly, corporate legal types tend to act as though it does. They're not the only ones. Some of my peers appear keen to (as I'll call it) “licensify” all scientific software. They seem to be acting on similar misunderstandings. I've even seen it written that one should not use the CRAPL because it allegedly makes it “impossible” to re-use the software. If you actually read and understand the CRAPL, it does no such thing (notwithstanding that as a license it is basically pointless). Even if the CRAPL were restrictive, the kind of software it's intended for—scientific software, as in my case—would almost always be available for relicensing anyway.

“Why not say “GPL, or other licenses by negotiation”?” It's not a terrible idea, but again, lack of nuance among corporate legal departments makes me believe that this will scare people off unnecessarily. I could be wrong about that.

“Why not MIT- or BSD-license your code?” I'm not against doing so in principle. But as a blanket policy, I worry that it'd weaken my ability to successfully do tech transfer later. In particular, the option of dual-licensing code as useful if you have already gone with a permissive license. This might weaken the position of any company set up to do the tech transfer. As I see it, this is the key difference between BSD and GPL from a researcher's point of view. In particular, note that pursuing tech transfer can be in the public interest (and bigger competitors are likely to be less public-spirited than a small start-up founded by scientists). I've nothing against permissive licenses in general; they just feel like too great a commitment to make in the early stages of a project. I happily release non-research software under these licenses (or even public domain) from time to time, as you can see. And I'll happily contribute under these licenses to other projects that have an established tech transfer model. The key points are that my early-stage projects don't yet have this, that there's more than one way to do it, and that licenses make a difference to what options are available.

“So you might later close off your research outputs for your own commercial gain?” Not at all. I'm dedicated to using my research outputs to create the greatest possible public social good. The question is how to do that. Whether BSD-style (permissive) or GPL-style (copyleft) licensing creates the most social good is still a matter of debate. It almost certainly varies case by case. Meanwhile, two other social goods need to be considered: being able to do more research, and actually getting research results transferred into practice. The commercial model can be useful in both of those regards, even though, ideologically speaking, it's not the model I prefer. It would be foolish to jeopardise these opportunities at the moment I publish my code, especially since nowadays the norm is to publish code early (no later than publishing the paper). Very often, at that stage it won't yet be clear how best to harness the code for social good.

“Aren't you inviting people to take the unreasonable risk of using your code without a license, which is breaking the law?” No. Anyone concerned with staying within the law is welcome to contact me about licensing terms. Anyone worried that I might sue them is welcome to do the same. This is no hardship: anyone interested in running or building on my experimental team-of-one research code almost certainly needs to contact me anyway—if not to build it, then to do anything useful with it. There's also an important separate point, which I'll return to shortly: scientific endeavour is built on personal contact and trust relationships, far more than either commercial or open-source software are built on those things. Even in the case of people who don't bother contacting me, the “reasonableness” of the alleged risk needs to be assessed in the context of the scientific culture.

“But in practice, if everyone followed your approach, won't people end up breaking the law more often?” Perhaps so. Most people break the law just going about their everyday business. Few of my countryfolk over the age of 30 have not transgressed our laws in countless ways: copying a tape, keeping a recording of a TV programme for longer than 28 days, wheeling a bike along a public footpath, borrowing their neighbour's ladder without asking, backing up a shop-bought DVD to their hard disk, and so on. Fortunately, the process of law enforcement involves discretion at the point of prosecution. This discretion is kept in check by social dynamics which, while mostly outside legal system, are very real. If my neighbour prosecuted me for borrowing his ladder, it'd render future reciprocal altruism considerably less likely. Suing good causes creates bad press, regardless of whether it can legally succeed. The law is overapproximate out of necessity, so that the genuinely problematic infringers can be pursued, not so that absolutely all contravening activity can be shut down. It presents the option to prosecute, not the imperative to do so. That's not an argument for egregious overapproximation in our laws. Obviously, legal overapproximateness can be vulnerable to abuse, and should be minimised by a process of continuous refinement. But aspiring towards eventually reaching a perfectly precise law is a hopeless ideal. The law is a practical tool, not a formal system—it is both ambiguous and inconsistent. The letter of the law needs to be viewed in the context of the legal process—one carried out with the frailty and imprecision that come from human weakness, and the malleability that comes from human discretion. In short, we are doomed to living with a background level of lawbreaking, and it's in our interest to live this way. With licensing, the right question to ask is: how can our use of licenses affect scientific research, and how should it? This means thinking about process.

“But don't big companies have good reason to be conservative? Won't this harm your relationship with them?” Yes, they do, but no, I don't believe so. Conservative policies are justified in a corporate context. For big companies in competitive markets, the stakes are high, and the capital cost of pursuing legal cases is affordable. If an equally large competitor had such a case, it could turn into a very costly court battle (regardless of who wins). That's why big companies are paranoid about IP: they're afraid of competitors. But they're not afraid of small guys, like researchers or universities. If I had a credible case that a big company were infringing my copyright, the company would very likely settle with me out of court. In any case, as I covered earlier, if they want to use my software, they will want to negotiate terms with me anyway.

“Isn't your attitude socially irresponsible?” I believe I have a stronger social conscience than most people I meet. Everything I've just said is a position I hold in the interest of maximising public good, whether or not my beliefs are correct or effective. So no, I don't believe I'm socially irresponsible at all. Here are some things that I believe are socially irresponsible: to allow corporate capture of publicly funded research outputs; to make licensing commitments of publicly funded work without very careful consideration; to needlessly jeopardise the future potential to exploit publicly-funded work to public-good ends; to allow the practices of big-company legal departments to get in the way of scientific endeavour.

“Licences are good, m'kay?” This seems to be some people's position, but it's not an argument.

“All sensible software projects have licenses!” Software that seeks users is one thing. I've already covered why I'm talking about something else.

“Why are you resisting doing the normal, “responsible” thing?” My situation isn't normal. In this abnormal situation, the normal thing is not necessarily the most responsible thing.

“GPL is not free, troll troll...” I'm trying not to get into that. But I do suspect that this whole debate is really GPL versus BSD in disguise. If you don't believe in copyleft, then the only open-source licenses you're likely to use are BSD or MIT, and there's not much useful flexibility to be gained by deferring the decision. So you can just license all your code straight away. BSD in effect chooses a fixed attitude to commercialisation (“anything is fine; what goes around comes around”) whereas GPL tends to create a fork in the road. I want to hang on to my option on that fork, until I know that doing away with it is the right thing to do. The fact that the permissive-versus-copyleft debate never dies is a powerful argument against coming down firmly on either side. It's better to take decisions in context, case by case. That's exactly the position I'm defending. Note that Bruce Montague's article advocating the BSD license model for tech transfer is right if you're talking about big projects—it's harder to commercialise GPL'd code than BSD'd code, and deliberately so—but completely wrong for most small research projects written by a small group of collaborators. In those cases, it's much easier to relicense the lot, in which case your options for transfer include a dual-license arrangement in the manner of Qt or MySQL. This still permits GPLing of the ongoing commercial development (or not), while retaining the option to license it for commercial development. It says a lot about how subtle the issues are that a simple tweak of context can more-or-less invert the up- and down-sides of the two licenses. (I actually find Montague's article quite fuzzy, even FUDdy, about the details—it only really says that BSD is “simpler”, which any old Einstein should be quick to challenge.)

“I once had licensing trouble, bad enough to require lawyers. Therefore, clearly explicit licensing is essential!” This is a complete non-sequitur. I don't doubt how problematic these things can get. But it doesn't mean that any blanket policy will improve things. If IP lawyers have come into a scene where science is supposed to be happening, it means something has already gone wrong. The culture clash is between science—where trust, good faith and cooperation are routinely assumed—and corporate licensing, where you can't trust anyone beyond what their contract or license binds them to. In short, this happens when big companies get involved and impose inappropriate licensing terms. These companies are where you should be directing your opprobrium—not at me.

To properly explain the latter point, I need to ditch the question-and-answer format (which was already becoming tedious, I'm sure). The argument came up in reference to the Richards benchmark. Let me summarise my understanding of it (corrections welcome). Martin Richards wrote and distributed some benchmarking code. It was distributed as a tarball, in the scientific spirit, with no license stated. It turned out (later) that he was happy to release it into the public domain. Trust, goodwill, reputation, responsibility, ethicality are currency among scientists. Martin's release occurred within that culture. My github repositories do likewise.

Some time later, Mario Wolczko produced a Java version of the Richards benchmark, while working at Sun Labs. Working for a big company makes a person more IP-fastidious than average. Mario very conscientiously sought permission from Martin (perhaps unnecessarily—I'm not convinced he was actually making a derived work, but perhaps he was; no matter). He was of course granted Martin's blessing. He circulated his new code, including to Martin. Martin even began distributing it in his tarball.

The catch was that Mario's employer was a large company, and would only allow the code to be released under some restrictive licensing terms. You can still read them. They come in two parts; the first page is statutory US export regulations while the second page contains Sun's own licensing terms. These include the following restriction.

“Licensee may not rent, lease, loan, sell, or distribute the Software, or any part of the Software.”

This is a severe restriction. It's especially inappropriate for a work of this kind—a benchmark. Benchmarking is a concern that cuts across research areas: it's not just people working on a particular technique or problem that want benchmarks. Anyone working on any aspect of a language implementation might want to apply some “standard” benchmarks. A benchmark is more useful the more things it has been applied to. That's why we often like to group them into suites and circulate them widely. This license is effectively incompatible with that use.

I'm sure Mario tried his best against the might of Sun legal. It's telling that his current web page advises the “caveat emptor” approach: make your own mind up about how likely you are to be sued, and act accordingly. I'd agree it's not good at all that we are being forced to take this approach around a big, litigious company (as Sun's new owner Oracle certainly is). But it seems bizarre to blame this on Martin's original sans-license release. The blame is entirely on Sun. The problem is not a lack of license statement from the little guy; it's an inappropriate license imposed by the big guy. The right response is to decline to use Sun's code. (That said, it's worth noting that Martin's redistribution of the Sun code, although technically a breach of the license terms, has not resulted in Sun or Oracle suing him. So perhaps they, too, understand the value of scientific culture and have no interest in disrupting it.)

To put all this more succinctly, the onus to license doesn't weigh equally on all parties. For artifacts coming out of big players, there's good reason to demand licenses—not just any license, but fit-for-purpose licenses—precisely because big players can afford to sue. By contrast, even the University of Cambridge is a comparatively little player. It is vanishingly unlikely to sue anyone pursuing public-good scientific or research agendas, not only because it's expensive to do so, but because the reputational damage would be huge. And even if you disagree that this particular university isn't a big player, that's immaterial because it doesn't own my research work's copyright anyway; I do.

I, like Martin, work within the scientific culture as a non-litigious “little guy”. It should go without saying that redistribution of my software towards scientific ends will always attract my blessing, not lawsuits—even if people don't ask first (mildly rude as that may be). Meanwhile, for commercial use, I'd like to defer fixing the exact terms until there's serious interest—so that I have some idea of how it's going to be used, and can choose terms appropriately. The pragmatics of “no license specified” and “restrictive license specified” are wildly different, even though a hard-headed legal interpretation makes the former a degenerate case of the latter.

The push towards “licensification” of scientific software actually works against this culture, since it erodes the norm that says that permission will be granted for doing reasonable things for scientific purposes. That's yet another reason why I resist this way of working. Licensification is also viral. If you believe strongly that all released artifacts should “have” “a license”, then you have to follow this to its transitive closure: to release anything that builds on anything, you must obtain a precise statement of license from all derived-from works. This is justified for commercially valuable software, but I don't believe it's the way science needs to work. And if you share my belief that premature commitment can do more harm than good, then we agree that science ought not to work that way.

You might complain that this “scientific culture” I've been talking about is nebulous. I'd agree. But capacity and inclination to sue are reasonable proxies. Call me naive, but if you really worry about the legality of doing science with artifacts released publicly by a reputable scientist working for a university, either you worry too much or the world has gone very wrong. On the other hand, it is reasonable to worry about artifacts released by big companies, because the culture they operate in is very different. That's why we need to pressure them into offering reasonable terms for the things they release, and decline to use their artifacts if they don't. That's what conspicuously failed to happen in the Sun/Richards case.

As more and more scientific research involves code, licensing of code among scientists is going to become much more of an issue. Misunderstandings are already rife: a quick bit of web searching will easily turn up things like this and this. We need to have a frank discussion about what the real issues are, and an open acknowledgement of their subtlety. I believe that blanket licensification is not the answer. It forces a decision to be taken too early, when too little information and expertise are available. I also fear that it will encourage a culture of litigiousness and bureaucracy that is neither desirable nor affordable within science.

As one tentative suggestion for how to make the law work better for us, I note that US copyright law's “fair use” allowance already covers educational purposes. Ideally it should also cover public-good non-commercial scientific purposes. In fact, this already is a factor in assessing fair use. But it's ill-defined, and I'm not aware of any useful case law on that front (are you?). If we want something to work towards, this kind of fair use law is what it should be—not the up-front licensification of every published scientific artifact.

Just as a coda: the above is what I currently believe, based on limited study and my very limited awareness of evidence, case law and the like. Although that belief has withstood a certain exposure to others telling me how wrong I am, it's not a conviction I'm strongly attached to. I went to the effort of writing the above not because I'm convinced I'm right, but because I don't like being branded “socially irresponsible”. Challenge my argument if you like, but don't doubt that I treat this issue with due care and attention. I'm still waiting for the counter-argument that will make me change my position. You might well have it. If you do, please let me know.

[/research] permanent link contact

Wed, 17 Jun 2015

“Type safety” and observability: an irony

As I've written before, people mean a lot of different things when they say “type-safety”.

The meaning I like best is that of Ungar et al: “the behavior of any program, correct or not, can be easily understood in terms of the source-level language semantics”. This is a dynamic property. It means we can explain failures in terms of the source language; they can't make the program do strange or arbitrary things. This is usually what people mean when they talk informally about “type-safety” in relatively lower-level languages like C, C++, or even Java, in dynamic languages (the context Ungar and friends were coming from), and also by Krishnamurthi and Felleisen. Language safety is an absence of corrupting failures; only understandable, explainable failures are allowed. (I find it irksome that “type-safety” has come to denote this general safety property rather than anything directly to do with type errors. Oh well....)

If type safety is something to do with explaining, what is an “explanation”? The most common kind of explanation is a stack trace. Stack traces are interesting because they combine a certain amount of live execution state and a certain amount of execution history. They tell us not only how the program was trying to proceed (its continuation), but also how it got where it is (its history). An explanatory stack trace will be heavy on the historical information, even if that information is not needed for continuation, because “how it got there” generally offers a clearer explanation than “what it would have done next”. For example, an ideal stack trace would include not only a list of active function calls on the stack, but also the values of their parameters—ideally the parameters' values on entry. Many of these values will be “dead” when the trace is generated: execution has moved on since the entry to the call, so those values are history (literally). This history is useless to the program in continuing its execution, but is invaluable to the programmer reading a stack trace, because they serve as explanation. This phenomenon—that “dead” state can nevertheless be useful as explanation—is one we'll return to.

If one kind of “type safety” is defined in terms of explanation of failure states, what's another sense of type safety? It's that “well-typed programs don't go wrong”. This is a completely different property, and I prefer to call it type-soundness. (Also note that it's using a different meaning of “type”.) In a type-sound language, by the time we run a program, we've proved that certain errors won't happen. The system for doing that proof is called the “type system”.

Now for the irony. Type-soundness is used to justify elimination of explanation in ML implementations. ML-family languages' type-soundness property means that the implementation can do away with run-time metadata (tags). This has the side-effect of removing the explanatory value that these tags would otherwise have. According to our initial definition, “type safety” was all about explanation.

But if we've proven type errors won't happen, there'll be no need to explain them, right? This us true, but the problem is that type errors are not the only errors that type tags can help explain. Even though no type errors will occur, other run-time failures will happen, and they are in need of explanation. Type tags are an obviously useful artifact for generating such explanations at run time. If we want to generate a stack trace, type tags make it easy to implement a more useful explanation—including the valuations of local variables, say. Without them, this becomes very hard.

Are other run-time errors really inevitable? The short answer is yes. Any general-purpose programming language, no matter how “safe”, has to be accept the fact that some run-time errors, for example unhandled exceptions or assertion failures, will be feasible. This is simply because computation is necessarily input-dependent. Even if you prove that your program never fails when run on in-spec input, out-of-spec input has to be detected and handled somewhere. No useful program runs in a completely closed system. (If you can safely assume you don't get any out-of-spec input, it's because it's been filtered out somewhere else in the wider system.)

Are type tags really necessary to present useful stack traces? No, not strictly. We can think of various schemes for restoring this ability, and my previous post goes into this in some depth. But these are much harder than just using tags, and it shows: implementers don't bother with them. On most implementations of ML-family languages, if a fatal exception occurs, the explanation offered will be, at best, a list of source-code locations (function names, possibly with file and line numbers) on the stack. It won't include valuations of locals, or anything about the concrete types of values. That's because the run-time tags which would allow printing out these things have been removed. In short, run-time failures are actively made less explainable by the erasure of type tags. (I'm told that AliceML is better than other MLs in this respect, though I haven't yet tried it myself.)

We should not be impressed by the “look, no tags!” property of ML unless some equivalent mechanism is in place to retain the degree of explainability which those tags afforded.

In theoretically-inclined PL literature, hard-to-formalise concepts such as debuggability or explainability—we could say “usability” generally—don't factor into how a design is evaluated. This is fine; it's simply Dijkstra's “separation of concerns” in action. Formalisation necessarily discards all details that are inessential to some specific (formalised) problem. A corollary is that problems that can't be formally stated are never addressed by theoretical research. Again, that's fine too. But producing a practical artifact calls for some kind of “systems thinking”: taking a much broader view than that of the theoretical results which may have inspired it. Any practical scenario is the superimposition of a collection of problems. Even for those that are amenable to formalisation, theoretically pleasing approaches to each one in isolation won't necessarily compose into a useful whole. The lesson of ML is that we have to be incredibly circumspect about transferring theoretical results, like erasability of tags, into programming practice. There must be a serious attempt to elicit and accommodate unformalised requirements. This needn't be deep or difficult: it's not rocket science to anticipate that we might want to observe the run-time state of an ML program, such as when explaining a run-time error. I find it sad that a succession of real ML implementations, all built by incredibly smart people, have adopted an unobservable, unexplainable approach for no good reason. Mistakes like this are holding us back.

[/research] permanent link contact

Tue, 26 May 2015

Polymorphism and observability

[In an earlier post, I talked about debugging, and more generally “observability”, in ML-family languages. Later, I also clarified what I think by “polymorphism” most usefully means. This post explores the less obvious relationships between polymorphism and debugging.]

When I talk to language-minded people about debugging code in ML-family languages, they tend to think that the difficulties are something to do with “the type system” and “polymorphism”. This is only half right. Polymorphism does complicate debugging, but it does so in every language, even ones with no “type system” of any kind. (As a sanity check, ask: does BCPL support polymorphism? The answer is clearly yes, at least according to my earlier definition.)

The axe I'm grinding in this series of posts is that “polymorphism” or “type-mumble” are no excuse for a lack of decent observability in ML-like languages. There are no major obstacles to implementing a nicely debuggable ML, and certainly none regarding polymorphism. Intuitively, this makes sense if we remind ourselves that polymorphism is to do with abstraction, whereas observation is done on the concrete: we're observing a concrete program state, laid out in front of us. (Of course, there are some unfortunate decisions taken by existing implementations of ML-like languages, that make retrofitting debuggability more difficult than it might be. That's a very different problem!)

A similar viewpoint explains why other kinds of fancy types present no obstacle. Even using fancy features of OCaml or Haskell, like GADTs or type classes, our program state still boils down to a big pile of atoms, sums, products and functions. The innovations in “type systems” have been in reasoning, specifically about safety. This is a question of dynamics: checking at compile time that invalid structures will never be constructed at run time. Observability isn't concerned with dynamics; it's about looking at the present, not the future. All we want to do is decode a static snapshot of the program. (Here I'm excluding the debugger feature of “altered execution”, i.e. enacting debug-time side-effects on the debugged process. How to do this safely is an interesting question, but I'm not going to dwell on it here.)

Can we expose more clearly why polymorphism isn't a problem? As I covered last time, “polymorphism” is a fancy word for deferring specialisation. Specialisation can be done in a compiler or at run time. At run time, specialisation means execution: execution is a specialisation process that culminates in the program's result. We can also think of this process as “decision-taking” in response to input.

Polymorphism during execution, witnessed in program snapshots, is very different from polymorphism in source programs. In source programs, the whole program is oriented around an unknown future: programs describe dependency on an “input”, supplied later. By contrast, observing a program state at run time is about decoding a present, not a future. Moreover, to help us do that decoding, we can exploit all the decisions that have been taken so far, i.e. all the specialisation that has occurred, during both compilation and execution, to reach the present state. Some of this specialisation can be called “monomorphisation”, because it has taken generic code and applied it in a specific context.

As before, I'll focus on OCaml. The OCaml compiler turns polymorphic source-level functions into generic run-time function objects (instruction sequences). Similarly, for polymorphic data types in source code, the compiler selects a size and layout, independent of any type variables that might be parameterising the definition. As we would expect, this is achieved using indirection: the fixed size and layout ensure that locally, storage can always be allocated and accessed generically. The specialised part of the data—for example, the payload of a list node—is indirected away, using pointers. If we were talking about C, these pointers would be pointers to void. OCaml's source language of data types lets us be more precise about these, by introducing type variables. But that's a meta-level bonus that helps us reason about dynamics. A snapshot of an OCaml program still reveals it as consisting of allocated objects and pointers between them, just like in C.

Viewed as source code, a program consists largely of functions and data types. But during execution, we have other things too: activations of functions and instances of data types. It's usually these that we want to inspect when debugging. For example, a backtrace is a list of function activations. The heap is a collection of values—instances of some type. (It also contains closures, which are a bit of both; I'll ignore them, for simplicity, but you can ask me at the end.)

Here is the key observation about polymorphism at run time. Whenever a polymorphic function is activated, or when a polymorphic data type is instantiated, some instantiation of its type parameters is morally decided. “Morally” means that we could define an oracle, using the semantics of the language, that tells us how they are being instantiated. For example, it could tell us that at some given allocation site, we're creating a list of int rather than a list of 'a (whereas the latter is all the source can tell us). Exactly what this means, however, is slightly subtle.

One subtlety is that the code doing the allocation doesn't necessarily know what this instantiation is. That code might itself be generic! So maybe we're building an int list list out of some int lists. The code doing this might only know it's building an 'a list list, but our oracle would still tell us that the allocation “morally” has the more precise type int list. Another subtlety is that, of course, there's no guarantee at the implementation level that our runtime is helpfully defining any such oracle for us. Nor need the compiler have provided us any output that would help us implement the oracle. In the case of OCaml, these are definitely not true, and that's precisely why it's difficult to add debugging support to the current OCaml toolchain.

Another subtlety is that the “instantiation” does not necessarily yield something free from type variables. Although our int list list example got rid of all the variables, in some other cases we might find the best we can do is to instantiate 'a with 'b -> 'c, say. But this turns out not to stop us from observing anything we might logically be able to observe. I'll return to this shortly.

One way to make OCaml debuggable might be to directly implement this oracle, by maintaining extra state at run time. Whenever we call a polymorphic function or instantiate a polymorphic data type, we could stash information somewhere that explicitly records how the type parameters are being instantiated. Something quite similar was done a while back in the HashCaml project. Unfortunately, it's a fairly invasive change to the compiler. It's likely to meet resistance via a performance argument, which you can think of this as the “frame pointer” debate but for type information. Pushing around extra information creates a bit more pressure on registers and memory, so typically shaves a few percent off performance. In return, we make observability massively more straightforward. Apparently opinions differ on whether this is a good trade. All I'll say is that if frame pointers are good enough for the Linux kernel, they're good enough for me.

Instead of tracking allocation types up-front, one could do a deeper analysis to recover the same information on demand. If we assume we have a backtrace, that gives us a powerful chunk of context: it describes a nest of function activations. The top-level activation (which would be main(), if OCaml had main()) is always monomorphic, so we should be able to figure out all the subsequent instantiations all the way down the stack. Or, we can flip that around: starting from a given activation, we should be able to figure out any type variable instantiations by looking some distance up the stack, and in the worst case, all the way to the top. Currently this is what my OCaml-implementing colleagues prefer; they expect it can work by looking no more than a few frames up the stack in the common case. The logic involved is basically the same as that of the compile-time type checker—which now needs to be replicated in the debugger and/or the language runtime. That's an annoying chunk of replicated stuff, which I find distasteful. Also, this inference might be expensive—fine for an interactive debugger, but poor for other applications of run-time type information (like serialization routines or a tracing tool, say). The advantage is that it requires fewer changes to the compiler.

A third option would be to relax our aim of recovering source-level types. In practice, we don't necessarily care, at debug time, that we're looking at an int list. It might be enough to look at each list node individually, seeing that it's a Cons, and then, separately, discover that each Cons points to an int. In this way we've avoided “typing” at the same granularity that the OCaml language does typing, but we've still recovered a somehow “typed” view of the program (i.e. one interpreted in terms of source-level data types). Put differently, source-level types like list encode information about a whole structure, spanning multiple allocations. Perhaps all we really need is a piecewise, per-allocation view. Currently, OCaml's tagged-pointer implementation ensures that at word granularity, we can distinguish integers from pointers. That's not enough, because we can't, say, distinguish the first variant of type T from the first variant of type U, nor from the integer 0: all are encoded as a zero word. But if we add local tracking of ADT variants and a few other things, that might be enough for observability purposes, and would be less invasive than a full HashCaml-style solution. I find this promising, although I'm still working through the consequences.

Suppose we stick with our oracle-based approach, tracking a source-level type for each allocated value. There seems to be a complication. I mentioned that type parameters are decided at instantiation points. but also that we might only be deciding that 'a becomes 'b -> 'c, say—we're not fully monomorphising them. This makes sense, and just reflects the nature of functions. Suppose we have a list of functions. A perfectly valid such list might contain the list head function hd. That's a generic function of type 'a list ->'a. When we instantiate our 'a list to one that can hold this function, we've specialised type parameter 'a to 'b list -> 'b. Our list is still polymorphic: we haven't got down to a monomorphic type. Does that mean we're lacking the ability to observe something in our program state? The answer is a resounding “no”! I mentioned that when debugging, we're looking at the present and not the future. The polymorphism in hd encodes the unknown future: we don't yet know what types of arguments the functions in the list will be applied to (it hasn't happened yet!). So, these polymorphic-at-run-time values in our heap represent the residual genericity in delayed computations, i.e. in functions. Functions encode things our program hasn't done yet, but might. They don't present an obstacle to decoding the current program state. In practice, any function has a name, even if it's a fake one generated by the compiler from the source code coordinates of a lambda. If we're in a debugger, getting the name of that function (or those coordinates) is comfortably good enough.

There's a final apparent complication. What about the empty list, or the null pointer? These seem to be polymorphic values. But unlike functions or data types, they're not going to get specialised further by activation or instantiation. The simplistic answer is that these values are okay because they're degenerate cases. It's not a practical loss of observability at run time if we can't answer the woodchuck-esque question of “what type of non-empty list would this empty list be if it wasn't empty?”. A more subtle answer is that these values aren't really polymorphic at all. If we think of how we would define the data type 'a list, we see that the Nil constructor, viewed in isolation, isn't polymorphic—it doesn't use 'a. In the context of this constructor, 'a is a “don't care” or ignored argument. An unparameterised constructor is only vacuously polymorphic: its meaning doesn't actually depend on the parameter. (This view, which sees constructors as somewhat independent of the type definition that encloses them, is one which OCaml's polymorphic variants directly build on.)

Finally, I alluded to some parallels with C. Just as the pointers which allow generic layouts for ML data types are equivalent to void pointers, so we have a similar problem when debugging C code: what's on the end of a void*? If I'm looking at a generic linked list node in my debugger, say, the debugger won't let me follow the pointer to the payload data. For that, we would need some run-time service that can look up some metadata about arbitrary memory locations and tell us what's stored there. Java-style VMs solve this problem using object headers. Clearly we don't have this in C; we need some extra infrastructure to answer these questions. I've been working on it: it's called liballocs. By dynamically tracking allocations in our running program, and using some carefully crafted associative data structures, we can build up a fast mapping from arbitrary pointers to metadata about the pointed-to allocation.

In fact the reason I got interested in this topic was that I wanted to make liballocs understand allocations made by OCaml programs. One of the complications liballocs has to deal with is polymorphic allocation sites. These sometimes occur in C code. For example, we might malloc() an array of generic void*, say, but actually use it to hold some specific kind of pointer. Genericity like this is occasional in C, and commonplace in ML. But there's no fundamental difference: code can be generic regardless of whether our source language includes a type language for describing that genericity. Genericity itself is what makes debugging tricky, because it indirects away concrete details in a way that some implementations (both C and OCaml) make hard to recover at run time. The presence of a fancy type system isn't the problem.

[/research] permanent link contact

Fri, 27 Feb 2015

Talking about polymorphism

[In my last post, I talked about debugging, and more generally “observability”, with reference to ML-family languages. This post is continuing the theme.]

Before I go into what's different about polymorphism between compile time and run time, I need to get some terminology straight. What is polymorphism anyway?

People traditionally talk about “ad-hoc polymorphism” and “parametric polymorphism”. These terms go back a long way— to Strachey in the the mid-1960s. They're unhelpful today, for two reasons. Firstly, and most trivially, some flavours of apparently non-parametric polymorphism are nevertheless rather systematic, so ideally shouldn't be called “ad-hoc”. Type classes in functional languages and method overriding in object-oriented languages both illustrate this.

Secondly, calling polymorphism “parametric” confuses people into thinking that if code doesn't have type parameters, it's not parametrically polymorphic. But, to me at least, this kind of polymorphism is not about how code is presented; it's a deeper property, about how code is expressed. It's easy to find code expressed in apparently polymorphic ways in all languages—even ones where there is no additional metalanguage of types to describe that polymorphism. Some C code is chock-full of such polymorphism—take Linux, for example. In such code there are no type parameters, only void. Similarly, one of the main strengths of dynamic languages is that code often comes out with some degree of polymorphism by default. If you write a sort function, say, you tend to get something that can operate on any kind of ordered sequence.

So although “polymorphism” sounds fancy, it's really just talking about delaying specialisation. If we're write code in a way that avoids commitment to details that would specialise it to one kind of data or another, we're writing polymorphic code. Fancy type systems are a means of reasoning about polymorphism, hence enabling static safety guarantees about polymorphic code. But they're not what enables polymorphism itself.

(I am probably departing from Strachey at this point, although it's not clear. His presentation doesn't directly address the question of whether polymorphism is really polymorphism without a metalanguage to describe it. For the purposes of my subsequent writings, I declare that it is, and that the metalanguage is a separate concern. Even if you disagree, it won't affect the substance of anything that I'm about to say; only which choice of words is an agreeable way to say it.)

Expressing code in a way that defers specialisation invariably means introducing some kind of indirection. Here I mean “indirection” in a very general sense. Run-time indirection via pointers is certainly one very familiar way of introducing polymorphism: a list node having a void* payload has avoided specialisation, whereas one with int hasn't. But there can also be compile-time indirection, such as between an identifier and its binding. And then there is also run-time indirection across not only storage, but also computation. (The latter is the trick which, when applied to its fullest extent, gives us objects in the sense of Cook.)

Somewhere along the line I have stopped using the qualifier “parametric”. In fact, the view I've just expounded—that polymorphism is the deferral of specialisation—covers ad-hoc polymorphism too, if we look at it the right way. When we say a source language supports “ad-hoc polymorphism”, it means that it lets us group together collections of specialised definitions. Such a collection of definitions, viewed as a unit, start to look like a single, unspecialised definition—in other words, a polymorphic definition. Again, indirection is what enables this polymorphism. If I write the token “+” in a language with overloading or type classes—Haskell or C++, it doesn't matter—it denotes the whole collection of defined addition operators. The code thereby avoids directly selecting any particular one. Typically, the selection is done indirectly. Usually it happens in the compiler, since the language defines some rules for selecting which one will actually be used. But it might happen very late in compilation since these rules might allow dependencies on the code's context of use (think how “+ is resolved in a C++ template definition, for example). And all this could equally well happen at run time; multimethods allow us to do this for binary operators like plus, while the usual single dispatch is implementing precisely the restricted unary case: deferring specialisation, on (only) the leftmost argument, until the execution of each individual call.

If polymorphism is this incredibly general thing, we need some way of getting down to specifics—like the question I started with, about what “polymorphism” is present during execution of compiled OCaml programs on a particular OCaml runtime. One thing that helps is to qualify any mention of polymorphism with a particular time: perhaps compile time, perhaps run time. At any given time, we can describe a polymorphic entity (perhaps a definition in source code, perhaps an object at run time) as somewhere on a spectrum between strong “specialisation” and strong absence of specialisation, which we can happily call “genericity”. (I say “strong” not “full” because the logical extremities of this spectrum tend to be degenerate, like “the output of the program for given input”—the most specialised state!—and “not written any code yet”—the most generic possible state.)

This “one time” restriction forces us to distinguish a source-level view from a run-time view. It's clear that this distinction matters. Polymorphism in source code can sometimes be implemented by specialisation before execution. C++ templates are the classic example: I can define a generic template, but the compiler will implement it by generating lots of specialised instances. Meanwhile, certain whole-program optimisations in ML compilers, like MLton, do effectively the same. This kind of compile-time specialisation is sometimes called “whole-program monomorphisation”. Just as “polymorphism” is a fancy word for the deferral of specialisation, “monomorphisation” is a fancy word for applying some kind of specialisation.

Polymorphism only defers specialisation; it doesn't avoid it. As I'll talk about more in the next post, all polymorphism eventually ends up being specialised away somehow. What varies is where and when this specialisation occurs. It might be done early, in a compiler, or late, during execution, by “elaborating” a particular trace of machine instructions in a particular machine context—i.e. by actually running the generic code! In the latter case, each execution is “specialised” in the sense that it runs in a different machine context and so produces a different outcome. Either way, execution is a gradual process of specialisation, entailing the gradual removal of polymorphism.

Phew. Perhaps that was a lot of belabouring the point. But it certainly helped me clarify (to myself) in what ways “polymorphism” might be present in ML code at run time. I'll revisit this precise question in the next post.

[/research] permanent link contact

Fri, 20 Feb 2015

Putting observability first

[Update: this article has now been translated into Russian! Thanks, Vlad.]

Last week I had a fascinating conversation with Mark Shinwell and Stephen Dolan, two colleagues who know the OCaml runtime inside-out. We were talking about ways to make compiled OCaml code observable: making it support interactive debugging and various other dynamic analyses (notably memory profiling).

It turns out that although this is not impossible, it is very difficult. It's fair to say that the OCaml runtime has not been designed with observability particularly high on the wishlist. It's far from unique in this: ML-family languages are usually not implemented with observability in mind. In fact you could argue that ML is explicitly designed for non-observability. The orthodoxy of the day held that tagged storage was the naive implementation technique, and erasing tags at run time was a good thing—mainly since it enabled a modest performance gain. Moreover, this meant that the provable erasability of tags came to be considered a valuable mathematical result, about language designs in general and ML in particular. The lingering side-effect is that this erasure also removes the ability to decode program data straightforwardly from its in-memory encoding. In the absence of some other mechanism—which mainline OCaml doesn't currently have—this greatly compromises observability.

ML is not alone in this. Although I'm not so familiar, I'll wager that Haskell, in GHC, suffers similar observability limitations—or perhaps worse, since it does more extensive compile-time transformations, and these cannot easily be “seen through” at run time. This makes it even harder to explain the state of an executing program, seen at run time, in terms of the source code. Also, I'm not singling out functional languages. Java has serious flaws in the observability mechanisms exposed by its virtual machine, as I once wrote a short paper about.

C, as usual is an interesting case. On the surface, it appears to be designed for little or no observability. It certainly doesn't generate much run-time metadata. This omission could be seen as an ML-like erasure of implicit tags. Yet this is not an accurate representation. Pragmatic working programmers, like Ritchie and Thompson, know well the value of interactive debugging. The first edition of Unix included db, an assembly-level interactive debugger mostly used on coredumps. By the eighth edition two debuggers, adb and sdb, were documented in section 1 of the manual, the latter supporting source-level debugging, while a third debugger (pi; built on Killian's procfs) was bundled in the extras which would become Plan 9. More modern debuggers have kept pace (more-or-less!) with more advanced compiler optimisations, using complex metadata formats like DWARF to describe how to see through them. (This isn't bulletproof, and has its own problems, but works remarkably well for C.) The result is that C code has been consistently highly observable—albeit not without forty years' continuous toil to co-evolve the necessary language- and system-level infrastructure.

This co-evolution is interesting too. The mechanisms for achieving observability in C code lie outside the language. Coredumps and symbol tables are system mechanisms, not language mechanisms. Observability in Unix has been part of the design, down in the fabric of the system; it is not an afterthought. An opinionated system designer (okay, I'll step forward) might claim that observability is too important to leave to language designers. There are, however, some gaps to plug and some mindsets to bridge in order to apply Unix-style debugging infrastructure to ML-like languages. In another post in the near future, I'll dig a bit deeper into OCaml, by considering polymorphism (and why it's not quite the problem it seems to be).

[/research] permanent link contact

Mon, 02 Feb 2015

Thoughts on peer review

My PLDI submission was rejected. I'm not too sad about this, since the reviews were basically encouraging, and on balance I agree that the paper can still be improved and was arguably not (quite) ready yet.

However, despite certain innovations, namely two-phase review and rebuttal, the conference review process is as creaky as ever. This is rather dispiriting, and the longer I spend doing computer science research, the more bizarre our system seems. I filled in the anonymous post-conference survey, but since my comments apply more widely than to a single instance of a single conference, I feel like forgoing my anonymity and re-posting them here (lightly edited). I hope they don't have too much of a sour-grapesy flavour—at least, that's not really how I'm feeling.

Something I didn't elaborate in my survey response: what would this “progressive model” be? It might involve reserving a smaller part of the programme for short presentations of recent journal publications, and a larger part of the programme for short presentations of high-quality in-progress work. This would be work at a substantially more advanced stage than a typical workshop paper—possibly reasonably mature already, but not yet accepted for publication in a journal. Initially, it seems important to ring-fence the slots for the latter kind of presentation, to avoid having already-published work always trump the not-yet-published stuff. Eventually, in a well-functioning community there would not be much call for the former kind of slot, since most journal publications would describe things that had been presented at previous conferences.

[/research] permanent link contact

Tue, 16 Dec 2014

For and against languages

A colleague recently summarised my research's position, not unkindly and partly in jest, as “against languages”. One one level there's some truth in that. I decry language-specific toolchains, runtimes, debuggers and package managers; I scoff at claimed benefits of being “100%” “pure” <your language here>, I am endlessly sceptical of anyone trumpeting one language too loudly. In my own work, I am preoccupied with language-agnostic runtimes, linkers and debuggers, cross-language composition and language-agnostic program analyses.

But that doesn't mean I dislike languages—far from it! Language innovations deliver their highest net value when we're free to pick the right one for the job, and when it's possible to adopt new languages without suffering massive pain. This also means it must be easy to migrate away from a language. It's only by humbly accepting any one language's inevitable imperfection that we can use each language to its best advantage, and can accommodate progress in languages generally.

Not all code is alike, and the world doesn't change overnight. We should therefore expect code in multiple languages to coexist. The alternative is getting stuck with the “current” and never making it into the “new”. We need to think of coexistence of languages not as a one-time transitional measure on our way to the perfect language, but as inevitable, ongoing reality. Even if we consider a particular language to be “legacy”, “outdated” or “broken”, we shouldn't make dealing with that language a second-class use case. If we are to have progress, we must give these scenarios first-class consideration. (This is why I lambast the status quo of “foreign function interfaces”, and in fact the very idea of them.)

So it's not that I'm against languages. I'm interested in making language innovations as useful as possible. I'd say that makes me more strongly in favour of languages than many people.

[/research] permanent link contact

Mon, 24 Nov 2014

I hate systems research... sort of

[I wrote this article on 6th August 2008, when I was a second-year PhD student, but then didn't have the guts to post it. Looking back, it's not so controversial, although it was obviously shaped by my (relatively narrow) experiences as a PhD student in a particular corner of the Networks and Operating Systems group here in Cambridge (a group I am very happy to be associated with, I should add).]

David Byrne, a well-known proponent of musics from around the world, once wrote an article entitled “I hate world music”. He hadn't experienced a sudden change in his taste, but was referring to the label of “world music”. By labelling all music that doesn't conform to a certain narrow set of Western-heritage styles as “world music”, it can be conveniently marginalised and dismissed as a unit.

So too with computer science research, labels are used to divide and marginalise. This time, it's the inside marginalising the outside. A recent EuroSys call for papers invited papers on “systems aspects of...” a long list of topics. Despite having been working in a systems research group for nearly three years, nobody has ever told me what “systems aspects” are. Yet most people who'd describe themselves as “systems researchers” seem to wear that badge with some degree of pride. Until someone can tell me what it means and why it's such a good thing, I won't be doing the same.

What defines systems researchers? Is it that they “build stuff”? This is perhaps necessary but hardly sufficient. Is it that they meddle with operating systems, networking stacks or other infrastructure software? Perhaps, but it seems to cover a lot more ground than that to which the label is commonly applied. My preferred take is that “systems” is simply a label, of historical origins, which now identifies a research community—a community which, like all social networks, is defined only by the links among its members and the continuity of that relation's evolution over time. One thing that has continued to surprise and frustrate me about the world of research is that despite huge overlaps of interest, it is divided much more than it is united: divided into largely separate, non-cooperating communities. Many systems researchers would be horrified (thanks to a snobbery I will come back to) if you claimed that their work overlapped with that of the software engineering community. Yet having attended in the last year both EuroSys and ESEC/FSE, I estimate that at least 40% of the papers I saw at each of those conferences could easily have been presented at the other without any sense of incongruity whatsoever.

Last week at the Semantics Lunch, Mike Hicks gave an excellent talk about CMod, the module system for C developed at the University of Maryland. Unlike other Semantics Lunches, the advertisement was cc'd to my research group's list. I get the e-mails anyway because I'm subscribed to other lists, but the obvious inference is the impression that systems researchers will be interested in any languages research that concerns C, but not in any other. This impression is certainly not fair, but the exaggerated impression does indeed stem from a grain of truth.

One way of getting your tools- or languages-oriented paper published in the systems community is to show how it applies to “systems code”. This almost always means one of three things: performance-critical code, resource-constrained code, or code written in C. These are all pretty general requirements, and fully relevant to the widest imaginable “software engineering” community. Regarding the last of the three requirements, of course there are many great reasons for creating tools that work with C, given the astronomically huge volume of existing software that is written in it. Working with C is certainly a priority for my own work. However, everyone knows that C is a language with many gigantic problems, and ones that have been known for decades. Cynically, I would claim that part of the reason that C-language developments, both code and tools, continue to flourish within the “systems community” is an endemic disdain for programming language advances, or in general of any research community that isn't “systems”.

This separatism comes in more than one form. There is a certain air of what I call “macho bullshit”, where, among other things, there is pride is associated with writing unnecessarily nasty code in unnecessarily low-level languages. Now, I have seen some exceptionally clean and well-engineered C programs, particularly ones coming out of my own research group, so this isn't any sort of blanket statement. I've also seen, again locally, many systems researchers who are heavily into language advances. But the other side of the coin is there too: the disdain and the separatism are definitely things I've observed around here, and I don't like them. They're joined by an attitude to criticism: there appears to be an unwelcome degree of sniping and dismissiveness, found in reviews and programme committees, together with a bizarre culture of unnecessary controversy, as typically found in workshops with “Hot” in their title. It seems there are too many systems researchers prepared to forward any criticism of others' work, however invalid, and pronounce any self-aggrandising position, however obtuse and misleadingly argued. This is, in a word, unprofessional, and it undermines the very goals of research. I don't have enough experience to know either how widespread these attitudes are, or how much more systems research suffers from them relative to other fields.

[/research] permanent link contact

Wed, 19 Nov 2014

How to write vaguely acceptable makefiles

It's not hard to write a clear, orderly makefile if you respect it as a programming job in its own right, and adopt some principles. I rarely see such principles written down. Here's what I've gravitated towards as I've learnt to use make more effectively.

[/devel] permanent link contact

Tue, 07 Oct 2014

Seven deadly sins of talking about “types”

[Update: this article has been translated into Japanese!]

My essay “In Search of Types” attempts to be a dispassionate review of some of the different concepts, purposes and attitudes surrounding the word “type” in programming. I imagine that my passions are still rather thinly veiled in places. But in this post I'm going to come out with a more unabashedly ranty take. Certain statements and attitudes really get up my nose. I was recently exposed to a few of these at Strange Loop (a great conference, I should add). So I've written down a list of “deadly sins” committed by many people (mis-)speaking about “types”

I should add that the theme here is rhetoric. What gets up my nose is when people don't make honest, transparent arguments. Their conclusions needn't be wrong. I program in OCaml a reasonable amount, and it's plain that I get a lot of value from the type checker. But advocates of type systems often try to sell them as a miracle cure, without acknowledging their limitations and undesirable side effects. Can we please move the debate past propaganda and blanket statements?

A contributing factor is that we often struggle to disentangle the many distinct concepts that lurk under the word “type”. My essay tackles this entanglement more directly, although hopefully the following rants will make some useful distinctions.

Not distinguishing abstraction from checking

This is my number-one beef. Almost all languages offer data abstraction. Some languages proceed to build a static checking system on top of this. But please, please, don't confuse the two.

We see this equivocation used time and time again to make an entirely specious justification of one language or other. We see it used to talk up the benefits of type systems by hijacking the benefits of data abstraction, as if they were the same thing.

The discussion of “stringly-typed programming” (sic) at Strange Loop, both in the Snively/Laucher talk and in Chris Morgan's Rust-flavoured talk, was doing exactly this. Yes, it's true that HTTP headers (to borrow Morgan's example) can and should (usually) be abstracted beyond plain strings. But failing to do so is not an omission of compile-time type checking. It's an omission of data abstraction. Time and time again, we see people advancing the bogus argument that if you make the latter omission, you must have made the former. This is simply wrong. Type checkers are one way to gain assurance about software in advance of running it. Wonderful as they can be, they're not the only one. Please don't confuse the means with the ends.

Pretending that syntactic overhead is the issue

The Sniveley/Laucher talk included a comment that people like dynamic languages because their syntax is terse. This might be true. But it made me uncomfortable, because it seems to be insinuating a different statement: that people dislike type systems only because of the syntactic overhead of annotations. This is false! Type systems don't just make you write annotations; they force you to structure your code around type-checkability. This is inevitable, since type checking is by definition a specific kind of syntactic reasoning.

To suggest that any distaste for types comes from annotation overhead is a way of dismissing it as a superficial issue. But it's not superficial. It's about the deep consequences type checking has on how code must be structured. It's also (perhaps ironically) about polymorphism. In a typed language, only polymorphism which can be proved correct is admissible. In an untyped language, arbitrarily complex polymorphism is expressible. Rich Hickey's transducers talk gave a nice example of how patterns of polymorphism which humans find comprehensible can easily become extremely hard to capture precisely in a logic. Usually they are captured only overapproximately, which, in turn, yields an inexpressive proof system. The end result is that some manifestly correct code does not type-check.

Patronising those outside the faith

If somebody stays away from typed languages, it doesn't mean they're stupid, or lack education, or are afraid of maths. There are entirely valid practical reasons for staying away. Please drop the condescending tone.

Presenting type-level programming as a good thing

No sane person can argue that we want to bifurcate a programming language into distinct base-level and type-level fragments. Gilad Bracha calls this the “shadow worlds” problem: to abstract over constructs at level 0, we need a whole new set of level-1 constructs, and so on. This is an antipattern. It's a failure of compositionality. That doesn't mean it can't be justified for pragmatic reasons. ML's module system is the way it is because nobody has figured out the maths to do it any better under the various design constraints of the system (which of course include a proof of soundness). But please, stop pretending that type-level programming is a desirable thing. It's a kludge, and hopefully a temporary one. Personally, I want to write all my specifications in more-or-less the same language I use to write ordinary code. It's the tool's job to prove correctness or, if it must, tell me that it failed. (It should also do so using a straightforward explanation, ideally featuring a counterexample or stripped-down unprovable proposition. I don't know any type checker that does this, currently.)

Fetishising Curry-Howard

Curry-Howard is an interesting mathematical equivalence, but it does not advance any argument about how to write software effectively.

Equivocating around “type-safety”

Here is a phrase which is used to mean many different things. Old-fashioned “type safety” is what I call “Saraswat-style”, after the famous “Java is not type-safe” memo. (Of course, it wasn't invented by Saraswat—it was the “standard” meaning of the phrase at the time.) It's a valuable property. But it's actually about memory, not data types: it's definable in terms of a language featuring only a “word” data type (using only minor tweaks of wording from Saraswat's). It's also nothing to do with static checking—“type safety” holds in all “dynamically typed” languages. The reason it's such a useful property is that it can be implemented without sacrificing a lot, unless you want to program close to the machine. Although many implementations happen to use syntactic compile-time reasoning, a.k.a. type checking, to lessen the run-time checking overheads, this is an implementation detail.

Saraswat-style type safety is usually a very good thing. But it's a far cry from proving near-arbitrary correctness properties of code. It's popular to confuse the issue, by saying that if you don't use type systems you don't have “type safety”. This wilful mixing of separate ideas is cashing in on the relatively uncontroversial good value of Saraswat-style safety, and using it to paint “type systems” as a similar no-brainer. Proof has a cost, and the appropriate expenditure depends on the task. It's far from a no-brainer.

Omitting the inconvenient truths

If everyone would just use a modern language with a fancy type system, all our correctness problems would be over, right? If you believe this, you've drunk far too much Kool-Aid. Yet, apparently our profession is full of willing cult members. Let me say it simply. Type systems cannot be a one-stop solution for specification and verification. They are limited by definition. They reason only syntactically, and they specify only at the granularity of expressions. They're still useful, but let's be realistic.

Try checking realistic reachability or liveness properties with a type checker, and you will not succeed. In a formal sense, this can be done, and has been, but the resulting system has little in common with type checking as practitioners would recognise it. The authors note that the ostensibly type-style specifications it yields are “bulky” and often “not suitable for human reasoning”. This is hardly surprising, since they are working against the grain of the program's existing syntactic decomposition. To specify and check liveness or reachability, we need to represent programs as some kind of flow graph, where flows are very likely to span multiple functions and multiple modules. It's not an expressionwise view of code, and it's also not the syntax in which most people want to code.

We need to grow up and accept this. Once we've accepted it, what do we do differently? We need a model that lets us assimilate gradual improvements in automatic reasoning, like better SMT solvers, without requiring us to re-type (pun intentional) our code. We need to be able to step up the strength of our specifications without switching language all the time. We need to integrate typed and untyped code, and between distinct languages too.

These ideas are slowly gaining some traction, such as in gradual typing, but have a long way to go. In particular, accommodating multiple languages has some far-reaching consequences for our infrastructure. So far, we implement languages in a particular way: relying on a single in-compiler type checker to establish whole-program invariants. If we embrace multiple languages, we can't build things this way any more. Instead, it's necessary to tackle invariant enforcement not within a per-language compile-time service but within a more generic composition-time service. (Hint: it's called a linker. In short, the linker-like service must do a mix of proof and instrumentation, in a language-neutral way, guided by descriptive information output by the compiler. If you've seen me talk recently, this will probably sound familiar. Ask me for more!)

Coding without proving everything we'd like to prove is an activity which will continue. Meanwhile, late binding, and hence limited applicability of truly ahead-of-time reasoning, is often a requirement, not an implementation decision. Eliminating these can bring benefits, but also can eliminate value. Weighing up costs and benefits, in a way that optimises overall value for a given product, is the right way to think about all these issues. It's a long way from the dogma, rhetoric and blind faith that currently dominate our discourse.


Now that I've ranted my heart out, here's a coda that attempts to be constructive. As I mentioned, part of the problem is that we equivocate when talking about “types”. Instead of saying “type” or “type system”, I encourage anyone to check whether anything from the following alternative vocabulary will communicate the intended meaning. Deciding which term is appropriate is a good mental exercise for figuring out what meaning is actually intended.

If you can think of others, do let me know!

[/research] permanent link contact

Mon, 06 Oct 2014

Project extra

I just thought of another nice project idea, so here it is.

A generic fuzz-tester, using DWARF

Fuzz-testing is a technique for randomised software testing. The software under test is run with randomly modified inputs, starting from existing test inputs. This covers more code paths, hence potentially finding more bugs, than using only human-crafted test inputs.

Typically, fuzzers are built around particular input domains. For example, we might write one fuzzer for C compilers which generates randomised C source files. We might write another fuzzer for X11 applications which randomly modifies the packets sent over the wire. (In fact the original fuzzer, xjig, did exactly this.) We might write yet another fuzzer for a particular library API, like John Regehr describes here. This works... but can we build a more powerful, more general fuzzing system than these per-domain solutions?

This project is about building a general tool for the latter scenario: fuzzing of library APIs. We want to be able to fuzz any library whose API we have a description of. For us, this will mean described by compiler-generated DWARF debugging information. There are several technical steps to this. Firstly, we need the ability to observe and (optionally) capture API traces, typically from running an existing test suite. This can be done using something reasonably off-the-shelf like ltrace, although we might want to make some modifications to suit our purposes. Secondly, we need to perturb these traces somehow to generate randomised versions. These can be randomised both in terms of the calls made and the arguments passed; the project would need to investigate various randomisation strategies. Thirdly, we then execute these randomised traces and attempt to detect any errors that occur—perhaps assertion failures reported by the program itself, memory errors reported by Memcheck (of Valgrind fame), or type errors reported by libcrunch.

For evaluation, we would like to measure how much we improve test coverage, relative to the existing test suites. We could perhaps also compare this improvement against the improvement obtained by API-specific fuzzers, like Regehr's, hoping to do almost as well (or perhaps even better!).

One problem with fuzzing is false positives. It might be that our randomised changes yield traces which exercise the API in way that aren't supposed to work. Normally we'd want only to generate traces in which the client stays in-spec, while perhaps leading the library itself to go out-of-spec (that's what a bug is!). (In security-flavoured fuzzing, the exposed attack surface is what's important, not the documented interface per se, but the same principle applies.) Such specifications are rarely written down precisely! So, extension-wise, an obvious direction is to refine our model of APIs to allow for more semantic constraints. To pick a trivial example, if we were testing the C library's malloc implementation, one constraint is that we shouldn't double-free a chunk. That's a little subtle—it's okay to free the same chunk of memory a second time, iff it was issued by malloc a second time! There is a lot of scope for investigating this kind of constraint, and, in general, to produce more sophisticated semantic models of APIs. There is a wealth of research about “API usage patterns”, but often using only a naive finite-state formalism that struggles to capture whole-trace properties (like the malloc one I just described). We could also investigate using Daikon, or some similar invariant generator, for inferring such invariants from collections of harvested traces.

[/research] permanent link contact

Mon, 29 Sep 2014

Progress by distillation

A theme of Joe Armstrong's Strange Loop keynote was that creating stuff doesn't necessarily make progress overall. Proliferation reduces value. Entropy is our enemy. He mentioned the bewildering array of build systems, packaging systems, unidentifiable system software (I never did catch what “grunt” is) and, more generally, copies of the same or similar code and data. Building a “compressor” was his likeably vague solution.

My talk arguably continued this theme. I was talking about proliferation of dynamic languages—in contrast to the original Smalltalk vision, which didn't envisage the host of broadly similar dynamic languages which now coexist alongside it.

An underrated way to make progress is by distillation. It's to recognise repetition, to integrate and unify existing systems where possible. We don't have to do so perfectly or completely. It doesn't have to be a Grand Unifying Design. We can work opportunistically. But we need both the right mindset and the right infrastructure. Tools for creating are not enough. Tools for integrating and distilling are necessary.

I seem to have written about this before. It's nice to see the same ideas cropping up in a keynote!

[/research] permanent link contact

Project ideas 2014--2015

I maintain a brief list of ideas for student projects which I'm always willing to supervise. Here I'll elaborate on some that I'm particularly keen on this year. As always: these are just a selection. I'm very interested in having students discuss their own ideas with me, or suggesting variations on what I have proposed. Please contact me if any of these ideas (or those on the linked page) is of interest. Most of these ideas could make Bachelor's or Master's projects, with appropriate tailoring which we would discuss.

Program slicing at debug time

Program slicing is a powerful program transformation. It seeks to create a minimal program that explains some strict subset of a program's state at a particular point in an execution, throwing away code that did not influence that subset. For example, suppose I'm debugging the following (very stupid) program.

y = 12;
z = y * foo();
x = y + 1;

I might see, when debugging, that after the third line, variable x has the value 42. I want to know how it got that value. A program slicer could show me a minimised version of my program that eliminated any statements that did not contribute to the value of x. Here, that's the middle statement (probably! that's if foo() did not change y somehow).

Although Mark Weiser's original paper emphasises the fact that programmers do program slicing in their heads when debugging, automated slicing still hasn't entered the usual repertoire of features that programmers find in their debuggers. This project will implement and evaluate a slicer that is designed to be usable from a debugger. The evaluation is likely to be in size comparison with an existing slicer—smaller slices are usually better. We can also measure the speed of the slicing algorithm—since to be usable interactively in a debugger, it shouldn't be too slow.

A simple slicer is not too much work. There are also many directions for extensions. Precise slicing is a Turing-complete problem, so all approaches are approximate. Better approximation through cleverer program analysis is the obvious avenue. Another is amorphous slicing, which explores semantics-preserving changes to the program's syntactic structure, in order to produce a still smaller program. A third is to use knowledge of the program state itself as an extra constraint on slicing—not just the position we want to slice at, but also the values of program variables. Slicing can be seen as a kind of backwards program execution, with a user-customisable degree of concretisation introduced. The “most amorphous”, “most concrete” slice is simply the relevant subset of the (branch-free) trace of program execution (if it's unique; else the union of all possible traces leading to that state). This might not be as useful as a slice preserving slightly more of the program's execution.

There is also an incidental problem that will need treating somehow. It is that most debugging infrastructure doesn't directly you reconstruct the original source code that you're debugging (at least in the cases of native Unix-style debugging and Java debugging). Instead it enumerates the source files that went into each compilation unit, together with the line number ranges of each, and provides a shallow mapping of program counter values onto these files and lines. In Java this is very nearly enough; for C, it's less good because the preprocessing step is not explicitly described. So, there might be some insight to be had about how source files ought to be represented within debugging information. One extension might implement such an improvement, by judicious hacking on a compiler (most likely LLVM) to modify and/or supplement the debugging information that it currently outputs.

A garbage collector using liballocs

My work on liballocs is developing a run-time library that dynamically tracks allocations and their data types across (potentially) a whole process. This means you can ask it what's on the end of a pointer, and it can usually tell you—even for pointers to stack or heap memory, interior pointers, and so on. It can be thought of as implementing a reflection API, as a platform for building dynamic program analyses, and as a platform for implementing dynamic languages (see my talk!).

This project is about implementing a mark-and-sweep garbage collector on top of liballocs. This can be thought of as a replacement for traditional conservative collectors like the Boehm collector. If liballocs's view of the process is completely accurate, and no pointers are computed in way that we didn't anticipate, then we obtain a precise collector. To deal with unusual address computations or tricky encodings of pointers, we might still need to build in some conservatism nevertheless.

In any case, the liballocs API allows a collector to be more “federated” in its design. Traditional collectors are owned by a single virtual machine (VM) which makes strong assumptions about how objects are laid out, what addresses are issued, what parts of the process can point into “their” heap (the “roots” of tracing), and so on. With liballocs, a collector can make a reasonable attempt at tracing native data structures directly, without knowing in advance how they are laid out—the library returns metadata describing the layouts. This allows tracking reachability over paths that stray outside a single region and then reach back in—for example, from the collected heap into the malloc() heap and back.

This “federated” design is potentially (i.e. in future work!) an enabler of reasonably seamless programming in a mix of languages, or when mixing manual with automatic storage management. For example, interfaces like JNI force the programmer to explicitly signal to the collector when native code takes a reference to a Java object. by creating a “global references” (which is managed manually). With liballocs, it can simply trace the native data structures too.

Although liballocs can usually tell you what data type is instantiated on the end of a pointer, there are a few exceptions to this (such as on-stack temporaries), and this isn't quite all the information we need. For collectors, it matters whether storage is initialised (hence meaningful) or not, and whether a value is live or dead. It also matters what funky address computations a program might do (all the way up to extreme cases like XORed pointers!). A large part of this project will mean finding some kind of solutions to these, as well as gathering an understanding of when they do and don't matter in practice. Most of these issues have “big hammer” solutions which work fairly well but are slow (like replacing malloc() with calloc() to avoid uninitialised heap data) and also more clever solutions (like skipping this if we can prove that an allocation site always initializes the whole heap block before the next safepoint). So there is plenty of scope to pick extensions to the core project.

Obvious comparisons for evaluation are the Boehm GC, and also perhaps the garbage collectors of the Go or D languages. We can compare in both speed and precision (or, inversely, conservatism). The Boehm GC is well known for having observable conservatism, keeping objects alive when they are not reachable, because of integers that alias pointers. We would expect to do better than Boehm for precision, while retaining comparable performance (within a factor of 2 for total memory management time overhead, say). An MPhil project would set more ambitious goals and use more advanced garbage collection algorithms (e.g. a generational and/or compacting collector).

A bounds checker in libcrunch

My work on libcrunch uses liballocs to implement run-time type checking for C (and other languages, like C++ and Fortran, in due course). The basic idea is that whenever a pointer cast occurs, we check that the cast-to type “matches” what's on the end of the pointer (for a sufficiently refined notion of match). One weakness of libcrunch is that it assumes the program is memory-safe, both spatially and temporally. Put differently, it will catch type errors caused by bad cast logic (analogous to a ClassCastException in Java) but not those occurring as a consequence of buffer overflows (“spatial” errors), nor use-after-free or use-before-initialize behaviours (“temporal” errors). This project will implement spatial correctness checking that integrates with libcrunch.

Unlike conventional spatial bounds checkers, like SoftBound or ASan, we have type information (from liballocs) so can do better in various ways. Most significantly, there is no need for per-pointer metadata; per-object type metadata is enough (this holds for C; ask me for the detailed reason why!). This means we don't need to move metadata around as we move pointers around, so we should usually see some performance wins. It also doesn't matter if we haven't instrumented all code; as long as we observe allocations, we will have the necessary metadata. However, there are some drawbacks: querying pointer metadata will involve a short search, via the containing object's metadata, rather than a direct-mapped lookup as in SoftBound. The main target of the project should therefore be to go roughly as fast as SoftBound. Lower memory overhead and lower virtual address space usage are also likely benefits of this approach; these can be measured.

System call interposition, specification and emulation

To understand the precise semantics of a user-level binary program running atop a commodity operating system, we must understand not only the instructions it executes but also the system calls it makes. Currently, we are making progress on formally specifying instruction set architectures, but the system call interface remains surprisingly murky. What system calls may a program make? What, precisely, are the valid arguments to each system call, and what do they mean? Given some arguments, what memory might a given system call touch? What other effects might it have on the program state?

This project will build a toolkit for observing, specifying, intercepting and emulating system calls as made by real, large programs. As a starting point, we have a basic (rather limited) preexisting infrastructure for trapping system calls in an x86-64 Linux program, and a sketch of a domain-specific language (not set in stone) for describing a system call interface. The project will produce a usable end-to-end system for intercepting, specifying and emulating system calls. In a basic usage, we would the calls back to the underlying operating system and observe it (e.g. using SystemTap or DTrace) to check the accuracy of our model. In a more advanced usage we would instead be running the program in an emulator, and would update the emulator's state to reflect the effect of the system call. The intention is to have this work for various pairings of OS kernel (Linux, FreeBSD) and instruction set architecture (Intel, Power, MIPS, ARM), although at least initially, some pairings are more important than others.

This project is very researchy, and has both practical and theoretical aspects. It's motivated by the research we're doing in the REMS project.

A linker, mostly functionally

Much like compilers, linkers have a crucial influence on the meaning of the eventual binary program. Current linkers are big piles of imperative code, with little explicit specification. This project concerns building either a static or dynamic linker for ELF binaries in a mostly functional style, in a way that will yield a clear specification of the various parts (relocation, address assignment, section concatenation or merging, etc.). Ideally the linker would be written as an “executable specification”, likely in the Lem language, making explicit the points where nondeterministic choice is taken within the written specification (as far as it exists).

Either a static or dynamic linker can be attempted; in the dynamic case, we have a very basic skeleton already. It won't be feasible to implement a fully-featured linker, but we will carve out some subset depending on interests.

An optimising linker

Traditionally, linkers do not understand the instructions in a program. Instead, they blindly patch binary code in predefined ways, such as “add n bytes to the four-byte signed integer at this address” (which happens to be the relative-address field inside a call instruction, say).

However, a linker is potentially a very good place to do optimisation, because interprocedural flows are necessarily visible to it. A dynamic linker is potentially a very good place to do dynamic optimisation, because it can observe code being loaded and unloaded, hence can perform optimisations speculatively yet safely. So, there is a case that linkers should understand instruction streams.

Current toolchain support for link-time optimisation (LTO) is limited, in both gcc and in llvm, by working on the toolchains' intermediate representation. This must somehow be propagated into the input binaries—the binaries must be “compiled for” link-time optimising. An alternative approach is to bite the bullet and teach the linker about instruction streams, so that it can disassemble and re-optimise the instructions directly, likely using debugging information as a source of type information where this is helpful.

Some other interesting applications of link-time instruction stream rewriting include whole-program instrumentation (e.g. to intercept certain procedure calls, system calls, or certain memory accesses, such as a generational garbage collector's write barrier), reference widening (to overcome the complexity of code models) and speculative dynamic optimisation (e.g. to do “devirtualisation” of indirect call instructions). One of these could perhaps be addressed as an extension.

The chief evaluation will be on performance improvements. There are also benefits in terms of binary size, relative to traditional LTO toolchains, which can be measured too.

[Update: I added another project suggestion in a separate post.]

[/research] permanent link contact

Wed, 02 Jul 2014

Why isn't verification standard practice?

Yesterday, Steve Crocker gave a very stimulating talk sharing thoughts—and also soliciting thoughts—about why verification isn't standard practice. He began with an anecdote about last year's BIND vulnerability (CVE-2013-2266) that could allow a single malicious packet to crash a DNS server. The problem is a simple bug in a regular expression library. Malicious clients can exploit the bug to make the process allocate huge amounts of memory, and hence kill it. How can we justify the fact that such simple bugs get into deployed code? Why can't we verify the absence of at least these simple bugs in anything that we deploy?

There were many useful contributions from the audience. It took me a while to collect my thoughts, but here are some personal responses that weren't discussed at the time.

Coding versus deployment

The BIND problem is primarily one of deploying shoddy software, and only secondarily one of programming shoddy software. Even with the best possible tools, making something robust will always take time and effort. Our problem is that we don't have good ways by which to make an informed decision about whether something is good enough (and if not, what needs fixing). This is re-stating a theme of the discussion: that our problem is at best only partly a technical one. I do believe that technical solutions and cultural solutions must go hand-in-hand: changing the technology can change the culture.

Proof as an activity much like programming

Considerable progress has been made in proof assistants like Coq and Isabelle, where proof starts to look like a programming task. In particular, the machine helps us do it and helps us check it. This is a really useful advance. Programmers are already used to satisfying a proof procedure, if they use any language with a type checker. But that doesn't mean we need to restrict ourselves to designing all our proof systems to be type checkers or things like them! I'll elaborate on this shortly.

One size doesn't fit all

I believe it's too limiting to expect a single compile-time analysis to tackle all proving and proof-checking workloads. If we write some immature code that's doing something subtle, we might expect machine proof (or disproof) to take some time. As it gets more mature, we can add more annotations and generally structure the code better, to help make the proof fast. We can't just forbid ourselves from writing or executing immature code. Of course if we consider the BIND scenario, meaning the case of mature, production software, then the deployed code should be sufficiently mature that a fast compile-time analysis is capable of producing and/or checking the proof. But we need tools that let us progress code across the maturity spectrum, not just demand a fixed level of maturity.

Language-independent analyses

One of the reasons that we get hung up on language so easily is that static analysis systems like to have a source-level input language. There's no reason why they need to, though. As programming researchers, I'd argue we have never been very good at recognising the need to accommodate diversity of programming languages in most of the infrastructure we design. This can and should change. (It's something I'm working on, intermittently.) One approach I think makes sense is “binary-level analysis, source-level properties”. We can annotate source code and push those annotations through into binaries, and check that they hold of the binary level. Binaries are what is deployed, after all, and as I've argued already, deployment is the point where the assurances are most needed. It also defuses complaints about the ambiguity of source languages. Binaries are a lot less ambiguous (and we're working on improving that too). While source-level correctness for all deployment environments, i.e. “correctness portability”, is a nice property, it's a harder problem and not always important. We should walk before we run.

Simple and not-so-simple bugs

Is it enough to consider only the simple bugs? During the talk, Jon Crowcroft rightly put it like this (I'm severely paraphrasing): rewriting everything in a better language would bring us to the next level of problems, and these would be more interesting than most of the ones we currently face, but (by implication) such problems will exist. It's not clear to me that a similarly nightmarish scenario could not occur in BIND owing to some much more subtle bug than a buffer overflow (or similar). If what I just said happens not to be true for BIND (again, following Jon's comments that DNS and other core internet services are simple), it's probably true for some other critical system (like the web).

The static-to-dynamic continuum

Some discussion of assertions during the talk was interesting. One audience member commented that some software crashes more than it would if it didn't contain assertions, either because the assertions are wrong or are not sufficiently critical properties to justify termination of the program. I find it hard to blame the assertions for this, It's true that if we must tolerate a risk that assertions might fail in production code, “log and [attempt to] continue” is a marginally better strategy than “abort”. But Steve Crocker countered that for critical, deployed software, assertions should be proved to hold—a position I agree with. That's a moral “should”; in practice we have to be very clever about how we enforce this. For example, we wouldn't want to unwittingly encourage programmers to delete assertions that they couldn't prove to hold. More importantly, to repeat my last paragraph, we need to allow developers to progress a codebase from immature to mature states. We might only deploy it when it's mature, but for development, we need to have something which admits run-time failures, yet is still executable. This is another reason why I believe a fixed compile-time analysis isn't the right approach, and that we should instead choose the level of assurance we want at deployment time. The tool that establishes (or denies) this assurance might also be a compiler, but needn't be.

Assertion languages, not annotation languages

Even though we don't currently prove assertions to hold, I'd argue that assertions are a great success because they elicit specification from programmers. Most programmers use them, and they elicit a lot of subtle properties which might otherwise not get explicitly written down. Improving the expressiveness of assertions is an approach that I'm particularly keen on. The TESLA work of Jon Anderson, Robert Watson and colleagues, on temporal assertions, is a great example. I have some work brewing about expressing type correctness using assertions.

The neat thing about assertions is that they are expressed in the programming language, and easily understood by any programmer. I believe that building specification and verification infrastructure out of programmer-friendly constructs, like assertions, is a necessary step to making them standard practice. We should avoid forcing programmers to use formalisms that don't map very clearly to the program. Again, TESLA is an example of this. Reportedly, programmers find it easy to write TESLA automata. They might not find the same thing easy to write in LTL or CTL*, even though those formalisms might be more friendly to somebody who writes verification algorithms. So, when I hear mention of “annotation languages” as a separate thing that needs to be bolted on to a source language, my reaction is to ask whether these annotations can be expressed in the source language itself, augmented by only a minimal and programmer-understandable set of new primitives. In my book, such a property then becomes an assertion rather than an annotation.

The performance culture

This is a bit of a departure but I can't help throwing it in. It's generally considered that infrastructure that slows programs down more than a few percent is not acceptable for deployment use. This is in spite of the relative plenty of computing power and the relatively huge expense of simple bugs like buffer overflows. Suppose we reach a point where all the “basic” bugs are routinely proved absent via verification. There might be some less basic properties which we can only check dynamically at nontrivial cost. Do we leave the checks in, or not? I'd argue that the performance nuts are usually getting the calculations wrong. The cost of extra servers pales to the cost of downtime, security compromises and the like. and they also pale next to the cost of debugging subtle failures that are not caught cleanly. Unfortunately, saying that the slowdown is too great is still a popular way to reject research papers. The commonly accepted criteria about “suitability for deployment” are finger-in-the-air stuff.

The I/O bottleneck

This was the least obvious of my thoughts, but is for me one of the most important. Let's consider the Heartbleed bug, which is triggered when a malicious client supplies a bogus length field and causes the server to overrun its buffer. This particular overrun would have been prevented by a memory-safe language, because the buffer copy would be bounds-checked. But why do people write code at the level of bytes and buffers anyway? Despite what some people write, “type safety” is not quite the issue here. Bytes and buffers are fundamentally semantically impoverished things, and yet we find ourselves writing this kind of code in every language, no matter how “type safe”, and no matter how expressive its data abstraction features might be. Coding at the level of bytes and buffers is always error-prone, even if one or other kind of error is ruled out by one or other kind of check mandated by a given language.

The reason we do it is because of I/O. I/O happens on bytes and buffers, because that's what the libraries expose. In turn, they blame it on the operating system: that's what Unix exposes. Once upon a time there was a lot of work on persistent programming languages, but they have not taken off. I suspect it's because they require too much buy-in. We don't want our data to be an opaque box that is managed only by a language runtime. What we need instead is to add specification to the byte-streams we already have. If we produce or consume byte-streams, we should specify their format, both syntactically and including some basic semantic properties. If we're OpenSSL, we'd specify that the first two bytes of the heartbeat message are a length field, and that it should be equal to the remaining length of the buffer. This can already be expressed in languages like Datascript. We might be able to come up with something even better (ask me!) but in any case, we also now need to push this kind of specification facility deeper into our libraries, language runtimes and operating systems. Once we've done so, it's trivial to insert bounds checks. We can even check subobject bounds inside the payload, i.e. referential integrity within the payload itself. In general, once we know the meaning of the bytes, we can actually treat I/O payload data abstractly. By contrast,even a “type safe” language currently only achieves abstraction by copying the data into data structures that it manages. Doing so unavoidably requires interpreting those bytes, and doing so unsafely—any checks that are done are entirely down to the programmer.

It's no coincidence that I've advocated this before. I'm working on it—after I work on all the other things that come before it in my list. If somebody wants to join me, that would be great....

[/research] permanent link contact

Tue, 01 Jul 2014

Drop-in debugging

If you're debugging a problem that occurs in some process deep in a process tree, getting a debugger onto that process can be hard. I recently wrote a script called gdbrun that can help with this. The idea is that for any command cmd args, you can do gdbrun cmd args and it will run the program under a debugger, while externally behaving just like the original program. The exception is if the program doesn't terminate properly: in that case the debugger sticks around for you to interact with. (If you need a more complex condition than “doesn't terminate properly”, you can code that into the script too.)

This isn't ideal—say, if you only want to debug only one invocation of that binary among many. But it works in many cases. (I'm also working on a solution for the harder case, but one thing at a time.) The script is deceptively simple. Sadly it needs GNU bash, but for any non-GNU people it should be easy enough to port.


exec 7<&0
exec 8>&1
exec 9>&2

quote () {
    sed "s/\([\`\\\\\\\"\\\$]\)/\\\\\\1/g"

while true; do
    if [[ -n "$1" ]]; then
        # don't use 'echo' to generate input for quote, otherwise "-e" will be disappear'd
        argstring="${argstring:+${argstring} }\"$( cat <<<"$1" | quote )\""
    ctr=$(( $ctr + 1 ))
    shift || break


# can add -hold here to help debugging
xterm -e gdb \
    --eval-command "file $exe" \
    --eval-command "run $argstring </dev/fd/7 >/dev/fd/8 2>/dev/fd/9" \
    --eval-command "set logging file ${exit_status_file}" \
    --eval-command "set logging on" \
    --eval-command "print \$_exitcode" \
    --eval-command "quit \$_exitcode" </dev/null

inferior_status=$(cat ${exit_status_file} | sed 's/.* = //' )
exit ${inferior_status:-0}

What was hard? Firstly, it took some head-bashing to realise that sharing a terminal between the debugger and inferior wasn't going to work. That's why the script uses xterm—popping up terminal windows for an instant is a minor annoyance, but tolerable. I can't say for certain that a shared-terminal solution isn't possible, but the usual trick of stty -tostop wasn't enough to stop child processes hanging from contention for the terminal, so I gave up.

The second hard thing, and the hardest, was realising that I could thread duplicated file descriptors right down to the inferior, and open them with /dev/fd/7 and friends. This was non-obvious to poor old me. I first tried a solution with named pipes and cat processes forwarding data from/to the gdbrun script's streams to the inferior's streams. After much gnashing of teeth, this turned out to be unworkable. The reason is that pipes are asynchronous channels, meaning they're buffered. A consequence is that even if the inferior doesn't read from its input, some input data will be buffered in the pipe when it gets opened (actually it's opened by the shell that launches the inferior, using gdb's run <infile syntax), and then discarded when the inferior dies. So if you have a command like echo blah |(gdbrun true; cat) your blah will get gobbled into oblivion. The moral of the tale is that introducing cats and pipes is a recipe for mischief.

[/devel] permanent link contact

Fri, 13 Jun 2014

Linking and loading: what's incidental?

Not long ago I gave a couple of talks about linking, loading and debugging here in Cambridge. It's a notoriously grungy area of our infrastructure, and a recurring question people asked me after the talks was as follows: what's the incidental complexity and what's the intrinsic complexity? Put differently: how might we redesign the whole lot nowadays, to get a less complex end result?

It's hard to answer, and partly it depends on what you value. On the surface, it seems like linking should be conceptually simple. Digging deeper though, it becomes apparent that there are a lot of concerns involved, and they are often in tension, making it very tricky for any simple solution to satisfy all requirements. That doesn't mean that simpler solutions are not possible, but we should not underestimate the intrinic complexity either.

What intrinsic complexity am I talking about? I can see four concerns central to the design of ELF dynamic linking. One is link speed: linking should be as fast as possible. Another is memory sharing: shared libraries were introduced (in part) to save memory. Another is flexibility, achieved through interposability: it should be possible to customise the link to add or replace features without modifying binaries. The last is federation: ELF is shared by many vendors' toolchains and many OSes, so the ELF runtime necessarily has a descriptive role, which is the basis for cross-toolchain runtime protocols for stack walking, exception handling and the like.

Part of the problem is that many people don't value all four of these concerns simultaneously. Link speed is valued by software distributors (who want to cut down startup times) and people who use embedded platforms (who can't afford overheads), but not workaday developers or, directly, users. Position-independence is valued by people who run the kind of systems where it saves memory, like desktop systems, and not by those who don't, like most servers. (Plan 9's decision not to bother with shared libraries is both reasonable, if you run mostly servers, and unreasonable, if you run mostly large desktop applications.) Interposability is valued by people who write tools and those with unusual requirements—which often entail reimplementing selected parts of core libraries (think malloc())—but not by those who don't. Federation is valued by those who value “openness”—competition among vendors, interoperation of tools, multi-language programming, and so on—but not by those who are comfortable with closed platforms, VM-style designs and/or “one true language”.

Few people care about all things simultaneously, so linking seems overcomplicated to everyone. But all these concerns add to complexity individually, and the tension between them also increases complexity. Link speed concerns create incentives for doing more work in the compiler or compile-time linker. This creates “premature optimisation” phenomena which complicate linker implementations (by adding to the variety of relocations) and also complicate the user-facing parts of the linker (such as getting the link options right). Sharing memory entails position independence; this creates complexity for the compiler author, who must support multiple code models, and the user who must grok the related compiler options. Interposition creates complexity for the linker author and user, and costs in performance since it prevents some binding happening earlier than it otherwise could. Federation adds complexity for the compiler author, who must take care to correctly follow conventions and emit metadata such as unwind information and debugging information. It also brings some cost at run time, such as how a cross-language cross-vendor exception protocol is slower than a custom one.

What design decisions can we rethink to get a system that satisfies all these requirements but with less incidental complexity? I believe that a smarter linker could help. The Unix-like linker design, as evolved to elaborate formats such as ELF, can be described as “smart format, dumb linker”. A key “dumbness” of the linker is that it does not know about instruction streams. Rather, the linking format describes all the ways that relocations can be applied as simple arithmetic on byte sequences. The linker's output is a mostly-deterministic function of its inputs (including linker scripts).

If we had a smart linker, we could have slightly dumber compilers—a win for simplicity since there are many more compiler implementations than linker implementations. We could also have less user-facing complexity: the compiler could just output code according to the slowest but most flexible code model, and the linker could take care of optimisations at the instruction level, including position independence. The key difference between such a linker and the current one is that a linker that can do these things must be able to parse and rewrite instruction streams. Current designs have been crafted to avoid this, but I believe this is a good fit for linking and not the burden that it first appears to be.

A smarter linker needn't yield slower programs than a dumb one. In fact it could speed them up. Link-time optimisation is rarely used at present, and not at all across dynamic linking. Meanwhile, interposition has nontrivial run-time costs in some cases, since symbol bindings must be computed at load time and run time. A smarter linker could use its awareness of instruction streams to deploy a variety of strategies to speculatively bind, cache and unbind references, both in disk images (which can be thought of as a less ad-hoc prelink) and in memory (e.g. inlining calls that currently go through the PLT). The former option can be thought of as link-time deoptimisation: we store on disk a binary that has been adaptively optimised for the common case on the particular system, and then if we find that an unusual link requirement comes along, the linker can produce a more flexible deoptimised version. The idea of “linkers as servers” has been explored before (e.g. in the OMOS linker, from Utah) but not in this particular way.

Compatibility concerns have also greatly complicated linking. Before shared libraries, things were fairly simple. To link against the C library, you link with -lc. This is just a bunch of object code in an archive. Contrast that with now. Shared libraries have been added in a “compatible” way that sits alongside, but doesn't replace, archives. As a result, we use both, and get double the complexity. For example, on GNU platforms we mostly use shared libraries, but libc_nonshared and -lgcc et al complicate all that. Since “compatible” shared libraries could not quite offer a complete solution for replacing the old-style linking, we have bifurcated the previous clean static linking model into a mostly-shared-but-partly-static model. A smarter linker could internalise the goal of code sharing. By rewriting instruction streams at load time, any compiled code is potentially shareable, and it's the linker's job to extract an adequate degree of sharing. There's no need for the user-facing complexity of distinguishing shared from non-shared linking.

Shared libraries themselves must bring some added complexity, since being (logically) shareable is a stronger requirement on a chunk of code than having it work only in a single context. In particular, robust shared libraries require symbol versioning or some other notion of “semantics”, to account for the inevitable evolution of software. Current designs for symbol versioning are themselves a great example of unnecessary complexity. Again, they were designed to be a transparent extension to what already existed—versionless shared libraries. This has yielded an overly circuitous solution. By pushing symbol versions into the symbol name, a lot of complexity is pushed to other contexts. Most irritatingly, cients of dlsym() are essentially broken by it, because they must deal with symbol names munged to include versions (or must switch to using dlvsym()). The “exact match” semantics offered by dlsym() no longer offer the right abstraction when looking up symbols provided by versioned libraries. But this solution was accepted because it didn't syntactically break anything (and semantics are the user's problem, right?). The ideal solution would look up symbols not (just) by name but also (partly) by their semantics.

Interposition is also arguably not a fully realised idea. One problem is that not every symbol is interposable, because of the static/dynamic split I mentioned earlier: some programs, and some parts of most programs, are still linked statically. Moreover, “interposition misses” easily occur where the interposer doesn't interpose on all the def–use edges it's supposed to, For example, in glibc, many library-internal calls to low-level functions like mmap() don't go via the publicly-exposed symbol. Calls that appear to be to open() might actually be linked to a mysterious alias of that symbol, such as __open_2(). Although interposition is still useful, the possibility of “missed” calls and the need to grok undocumented binary interfaces like __open_2() make interposition-based software fragile and expensive to maintain. This could be fixed by a more uniform dynamic linking model and a more semantically-founded notion of interposition. One that might work would be interposition based on code paths, not entry points. Again, implementing such a notion inside a linker would require that linker to understand instruction streams and the control flows they allow. When we ask to interpose on a (function) symbol, what we're really asking is a bit more like to insert a new call-graph node that dominates all those in the interposed-on call graph. (It's a bit more complicated than that, since this alone doesn't solve the glibc problem I mentioned, but this basic idea can likely be extended to yield the interposition semantics we really want.)

Hardware complexity also leaks into linkers. The root cause of complexity in relocation and code models is the diversity of addressing modes offered by hardware. Linker authors and others working at the toolchain level tend to take on trust the good sense of the hardware designers' decisions. Whether this complexity is intrinsic or incidental depends on whether we allow ourselves to rethink the hardware too. For the moment I'm going to take this diversity as a hard constraint, but I'd welcome views from more hardware-minded people.

All this is not to say that old-fashioned complexity creep isn't behind its share of problems. Debugging information is a great example of “limb growth” and is crying out for a cleaner design, ideally one that helps reduce the burden on compiler authors and/or improve the accuracy of the debugging information they generate. For linking proper, though, the old-fashioned Unix model of “dumb linking” has stayed relatively clean. It's been keeping that model, in the face of new requirements such as shared libraries, versioning and so on, that has complicated things. I believe it's no longer the right model. Luckily, we can produce a smarter linker that is nevertheless a drop-in replacement for the standard one, so we have an evolutionary path towards something better.

[/research] permanent link contact

Wed, 14 May 2014

Instrumenting casts in C++

A reviewer of my libcrunch paper wanted to see me back up my claim that its architecture is “multi-language” by demonstrating support for another language. In particular, for some strange reason, he wanted C++ support.

I had shyed away from C++ support because C++ front-ends are infamously complex. I don't know of a particularly friendly system for instrumenting C++ code. But on reflection, perhaps this isn't necessary. One of the joys of C++ (yes, you heard me) is that it's designed so that user-defined constructs are more-or-less on par with built-in constructs. My instrumentation needs to instrument pointer casts which would otherwise be unchecked. If you follow good C++ style, these are done using static_cast and reinterpret_cast. (Note that dynamic_cast is already checked by the runtime.)

Can we define our own checked_cast template that is a drop-in replacement for these two built-in operators? Then all we need is to #define the latter to the former and we've done the instrumentation.

I've managed to get something that appears to work well enough, but this is easier said than done. We might be tempted to write a function template like the following.

template <class From, class To>
To checked_cast(const From& f)
    // ... do our check, then...
    return static_cast<To>(f);  // or reinterpret

But that's no good, because we need to distinguish the case where To is a pointer from where it's not. We can't partially specialise function templates, so we can't do this. It's also no good because some instances of static_cast need to be constexpr. Indeed, static_cast to and from integers and chars is used extensively in header files, in constexpr contexts such as expressions which size arrays, or the bodies of inline library functions that are themselves constexpr. We need a more fine-grained approach.

Let's try a class template. This is awkward because casts have to look like function invocations. More specifically, we're instrumenting expressions of the form cast_operator<T>(expr) and we need something that drops in where the cast_operator bit goes—we can't modify the T or expr parts, or add anything afterwards. What seems to work is defining a class template such that the cast expression becomes a constructor invocation of a class we provide, say checked_cast_t<T>. We then define a type conversion operator to turn it back into something of the result type. The compiler inserts the conversion implicitly, so we have a drop-in replacement.

template <typename To>
struct checked_static_cast_t
    To temp; 
    operator To() const { return std::move(temp); }
    // construct a temporary, which decays 
    template <class From>
    checked_static_cast_t(const From& arg) : temp(static_cast<To>(arg)) {}
    // overloaded constructors go here

// ... specialisations go here

#define static_cast checked_static_cast_t

We need to specialise the above to make various cases constexpr. In particular, we want all casts to integral types to be constexpr, and all casts from integral types to be constexpr if they're not going to a pointer type. We handle the former by specialising the whole template, instantiating To for the various built-in integral types. We handle the latter by overloading constructors to be constexpr (which, perhaps surprisingly, does work). We also need constructors that take rvalue references. All that expands out to a lot of tedious code, but the result happily compiles various standard library headers that make use of static_cast and reinterpret_cast. We can then supply the header to our compiler using a command-line -include option, and hey presto: we've instrumented casts in C++ using just the C++ language itself.

It's not perfect, though. We can't instrument C-style casts. It's also certainly possible to construct code that compiles without the instrumentation but not with it, by requiring pointer casts whose results are constexpr (and which, by definition, we refuse to provide). In particular, it'd be nice if constexpr were a modifier that can be a basis for overloading—then we could ensure that a cast for constexpr input always yield a constexpr results, by overloading this case explicitly. As it is, I'm hoping it's good enough; I'll post back later when I have more experience of using it.

[/research] permanent link contact

Tue, 08 Apr 2014

Dynamic linking and security

The thoughts in this post were provoked after reading Tim Brown's very interesting Breaking the Links article.

Dynamic linkers are notable for privilege escalation bugs. The reason is their interaction with the setuid mechanism, and indeed any mechanism that associates privileges with an executable. Unix's original model where executables are trusted in their entirety is fundamentally flawed on modern platforms that have shared libraries, where executables usually link in other code, some of which can be supplied by the user. Rather than getting rid of the now-flawed setuid mechanism, currently dynamic linkers instead impose a raft of ad-hoc restrictions, lashed together in the hope of closing off any paths by which user-supplied code can get into a setuid process. They must also balance this against another goal: to avoid the unwanted side-effect of ruling out some perfectly trustworthy compositions. Unfortunately, these ad-hoc measures invariably fail on both counts.

What does setuid mean? It means that the invoking user has access to any behaviour allowed by the setuid program, as executing with the program owner's effective uid. Attackers escalate their privileges by introducing unanticipated code which widens that set of behaviours. Can we take a better approach? One naive idea would be to construct the process as normal, and then check that it includes only trusted code; at that point, we decide whether it runs with elevated privileges or not. (A wart is that we also have to account for code loading after the start of execution; let's ignore that for now.)

Does this break anything? Certainly it will do. I might run a job that spawns a process tree in which some subprocess is setuid. Normally I can run the tree with some other library LD_PRELOADed, expecting that although my library won't get preloaded into the setuid process, that process will still run with elevated privileges. Under our proposed new model, if we do the preloading then discover that the preloaded library is not trustworthy, we will run it with lower privileges, and likely break the process tree (assuming the process really needed to be setuid).


This is a feature interaction, and what we need is a policy for resolving the interaction. Current Unices have the policy that “setuid clobbers LD_PRELOAD”. The alternative we just considered is that “LD_PRELOAD clobbers setuid”. Neither of these seems adequate. Perhaps instead we can evolve things towards a more subtle mechanism that can avoid the interaction in the first place, perhaps by selecting among untrusted and trusted libraries. For example, if there are multiple available versions of some library, we might use the more trustworthy one instead of the one that a fixed set of name lookup rules guides us towards. In general, we can see this as resolving ambiguity among a set of depended-on library specifications in a way that maximises value (both trust and functionality).

Doing so requires a way to designate what code is trusted, not just what executables are trusted. We also need a sense of what alternative ways there are of satisfying “the same” link requirement. I have been using the example of LD_PRELOAD so far, but on ELF platforms, link requirements (beyond the executable) are specified as either a PRELOAD or (more often) a NEEDED, a.k.a the DT_NEEDED header of ELF's .dynamic section.

To find alternative implementations of “the same” requirement, we can mine the redundancy inherent in RUNPATH, LD_LIBRARY_PATH and perhaps the multiple notions of ORIGIN that can created by hard-linking. Each of these might provide multiple candidate libraries. Setting up a fake ORIGIN is a trick familiar to crackers, but we can turn it around by enumerating all possible ORIGINs of a given shared object and considering all the libraries we find there. (Sadly this requires a scan over all directories in the filesystem, and in the common case will yield only one library. But this approach will defeat link-based attacks, since even after hard-linking, we will still find the original library, and any sensible trust heuristic will select it in preference.) The ABI tag matching (modified by LD_ASSUME_KERNEL) is another instance of how the linker will look for libraries in particular places satisfying particular properties, in a way that is currently very rigid but could be generalised into a search/optimisation problem where paths supplied by developers, administrators and users are used as hints and bootstrapping input, rather than definitive instructions.

This approach brings two further problems. Firstly, what's to prevent us from choosing a probable-looking binary that is semantically broken (with respect to our use of it)? We can argue that all binaries with the same soname should be interchangeable, but in practice there will be difficulties. And matching by soname might be too restrictive anyway. Secondly, like any search- or rule-based system, our approach has a “delocalising” effect, lessening the administrator's explicit control and making the linker's behaviour more complex to configure and debug.

Another subtlety is that trust in the part is not the same as trust in the whole. Even if we refine Unix's notion of trustedness down to libraries rather than just executables, some exploits can work by combining trusted code in untrusted ways. The case of another linker exploit, CVE-2010-3856, is one instance of this: the library is sane enough that it could easily be deemed trusted, but we can construct a very specific context in which it is anything but. (This context is: use it as a linker-auditing library to a setuid binary, causing its constructor to be run with elevated EUID, hence allowing a temporary file exploit that would not emerge in “normal” contexts where the constructor did not have elevated privileges.) This is a classic “confused deputy” situation.

Confused deputies are always a good argument for yet finer-grained models of privilege, such as capabilities. So it's not clear whether we would get much security value from search-based link-time composition, relative to plumbing a better model more deeply into our operating system.

[/research] permanent link contact

Mon, 13 Jan 2014

C libraries and linking

At my talk today, Simon PJ asked an interesting question which I managed to give a slightly wrong answer to. I had observed that asking my C compiler to link an object file invoked the linker with a lot of extra input files, many of which are specific to the particular C library implementation being linked to. Note the various crt*.o files in the following link command concocted by gcc. These files come from the GNU C library.

$ gcc -### -o hello hello.o 
/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/collect2 \
  --eh-frame-hdr \
  -m elf_x86_64 \
  -dynamic-linker /lib64/ld-linux-x86-64.so.2 \
  -o hello \
  /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o \
  /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/crtbegin.o \
  -L/usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.8.0 \
  -L/usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/../../../x86_64-linux-gnu \
  -L/usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/../../../../lib64 -L/lib/x86_64-linux-gnu \
  -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu \
  -L/usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/../../.. \
  hello.o \
  -lgcc \
  --as-needed -lgcc_s --no-as-needed \
  -lc \
  -lgcc \
  --as-needed -lgcc_s --no-as-needed \
  /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/crtend.o \

What does this mean if I've compiled some of my program with compiler A (from some vendor whose C library is in /usr/A/libc.a, say) and some with compiler B (from another vendor whose C library is in /usr/B/libc.a)?

It's tempting to say that C compilers are strongly coupled to their library, so we must link via some unique C compiler and use only its library. Does this preclude using another C compiler for some of our program? I answered more-or-less in the affirmative... but it's not true! There are two clear (in hindsight) bits of evidence to the contrary.

The first is that empirically, it's easy to see the same C library being used by multiple compilers. The important thing is that there's only one set of library headers. When I install clang on my Linux box, it happily uses the incumbent glibc library headers when compiling. When linking, it happily issues the right linker command to link with the glibc binaries. Indeed, it issues a very similar linker command to the one we saw earlier. We can again see the glibc-provided crt*.o objects being linked in.

$ clang -### -o hello hello.o
Ubuntu clang version 3.2-1~exp9ubuntu1 (tags/RELEASE_32/final) (based on LLVM 3.2)
Target: x86_64-pc-linux-gnu
Thread model: posix
 "/usr/bin/ld" "-z" "relro" "--hash-style=gnu" "--build-id" "--eh-frame-hdr" \
 "-m" "elf_x86_64" "-dynamic-linker" "/lib64/ld-linux-x86-64.so.2" \
 "-o" "hello" \
 "/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/crt1.o" \
 "/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/crti.o" \
 "/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7/crtbegin.o" \
 "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7" \
 "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu" \
 "-L/lib/x86_64-linux-gnu" \
 "-L/lib/../lib64" "-L/usr/lib/x86_64-linux-gnu" \
 "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7/../../.." "-L/lib" "-L/usr/lib" \
 "hello.o" "-lgcc" "--as-needed" "-lgcc_s" "--no-as-needed" "-lc" "-lgcc" \
 "--as-needed" "-lgcc_s" "--no-as-needed" \
 "/usr/bin/../lib/gcc/x86_64-linux-gnu/4.7/crtend.o" \

But how does it know about these files? The answer is worse than I had imagined. The file lib/Driver/ToolChains.cpp in clang's codebase embodies a ton of knowledge about linking on different platforms—even down to individual GNU/Linux distributions and versions thereof. Unsurprisingly, this is fragile and has been known to spawn bugs, like this one.

The second bit of evidence is in how most C compilers let you tell them to use a foreign set of headers, which could be from any C library implementation we like. To avoid the “standard” headers in /usr/include you need to use an option like -nostdinc, and then use -I to point it at the right headers. Assuming the headers don't use any non-standard features, there's no reason why any compiler couldn't generate code using any other vendor's library headers.

Of course, “there's no reason why not” often translates to “it is unfortunately impossible”. A data point in this case is provided by uClibc, a small replacement C library, whose FAQ notes that “it is possible in some limited cases to re-use an existing glibc toolchain and subvert it into building uClibc binaries by using gcc commands such as -nostdlib and -nostdinc... [but] it proved impossible to completely subvert an existing toolchain in many cases.” I'd love to dig into the details of what didn't work, but that will have to wait for another day. (It might just amount to the same issue we've noted, i.e., making the C compiler generate the right link commands... but it wouldn't surprise me if there's more to it.)

Another case I haven't handled: what about multiple C libraries (or parts thereof) in the same process? An obvious problem is conflicts in the symbol namespace—since the ABI likely requires a unique definition of some symbols, forcing a choice between the two libraries. (Chapter 6 of the System V AMD64 psABI is in a vague state, but appears to confirm the existence of this kind of constraint.) However, with enough hammering on symbol renaming and scope-localisation, there's no reason why certain portions of two different C libraries couldn't coexist. But it seems unlikely that two implementations of any low-level parts (such as startup and shutdown, threading and signal handling) could be combined in a working fashion without being explicitly designed to do so.

In summary: any compiler “should” in principle be able to generate code targeting any C library on a given platform, but there are inevitably some warts that inhibit this mix-and-match in certain cases. Moreover, knowing how to link to a given C library requires knowledge of nasty implementation details of that C library. These details could perhaps be “promoted” to a vaguely standardised ABI detail of (in our case) the combined GNU/Linux/glibc platform, but this hasn't been done so far.

Another question that Simon asked was whether we could apply some hindsight to come up with a simpler model which avoids the hair-raising complexities that I'd talked about. One suggestion I had was a single virtual address space, which would eliminate the need for position-independent code in shared libraries (since a library's load address could be assigned at deployment time). Later, Raphaël also reminded me that Plan 9 does away with shared libraries altogether, apparently without it costing too much in memory. I'm sceptical that this wouldn't be costly on a desktop system though (think how many copies of the KDE or GNOME libraries you'd end up mapping, for example). I'm also a big fan of Multics-style orthogonal persistence, which ties in quite nicely with the SVAS approach, but is a far-reaching change. Meanwhile, I think the trade-offs surrounding large-versus-small processes and the complexity of near-versus-far addressing modes are quite difficult to avoid (without an obvious performance hit), since they come to us from the hardware. Perhaps we could use reconfigurable hardware somehow to push all that complexity out of the view of compiler writers... but I doubt any hardware designers consider this a problem worth tackling.

I'm giving a “Part 2” follow-up talk on (most likely) Monday 3rd February.

[/research] permanent link contact

Tue, 26 Nov 2013

(Tell me why) I don't like Java

I can just about cast my mind back to when I learnt Java. It was 2001, when I was a 17-year-old C++ programmer. (Actually the particular version of Java I first learnt was called C#, but let's not complicate things.) I remember being fairly impressed. It struck me as a C-family language that had kept enough of C++'s features to do abstraction neatly enough, modulo the occasional grumble, while making some impressive complexity savings by carefully-chosen omissions. (I knew nothing about Smalltalk at the time.)

At the time, many of these omissions were bold, and the way they worked together was, to my 17-year-old self, ingenious. Lack of subobject structure keeps garbage collection simple. In turn, garbage collection (with compaction) allows faster heap allocation, clawing back some of the cost of all these extra objects. Even doing away with templates, by relying on checked downcasts and a common supertype, had its elegance, despite now being supplanted by generics. Grumble-wise, I won't say I wasn't alarmed by the lack of multiple inheritance and operator overloading. But overall I wasn't unsympathetic to what Java was doing.

Nowadays, the mention of Java makes me groan (in my mind). It seems to be a combination of the intrinsic and extrinsic properties that makes it so. Thinking extrinsically, Java has a huge, huge mindshare. Like market share, mindshare is dangerous once one player gets hold of too much of it. Many intrinsic details of Java would be unremarkable warts in another context, but have a giant significance, because they're what a large proportion of programmers think are “the way programming is”. In reality, Java contains a selection of somewhat-arbitrary design choices made by clever but flawed human beings some time within the last couple of decades. By letting them become ingrained, we are closing people's minds and inhibiting adoption of better alternatives. What follows is a quick run-down of how I see the problems.

As a language, it's boring. This is actually one of the least bothersome things about Java, but it needs mentioning here. As a researcher, doing any work that is specific to the Java language feels like work not done properly, because it avoids a lot of potentially more interesting cases. (I'm being a little facetious here, but I hope the underlying point is visible.) Against today's competition, Java seems remarkably inexpressive as a language. C++ is cleaner than ever, Scala is achieving a C++-like complexity curve with nicer syntax and marginally less unclear semantics, and dynamic languages are faster than ever. What keeps Java going is its self-perpetuating ubiquity. It also helps that Java is in a sweet spot regarding IDE assistance: Java code has enough statically-derivable properties to do automations like refactoring and autocompletion reasonably precisely, while being simple enough—unlike C++ and, I venture, Scala—to implement them reasonably correctly without giant effort. Is this a good thing or not? It has no doubt delivered certain innovations to practitioners faster than would otherwise be possible. But right now, to me, its net effect seems to be to stifle innovation in better languages.

As a learning device, it's a dog's breakfast. This is probably my number-one beef. Java is too complex to cleanly expose concepts like procedures, structured data, or pure object-oriented programming—never mind functional styles of programming. Yet it is too semantically simplified (constrained) to reveal essential contrasts and orthogonalities, such as between by-value versus by-reference, heap versus stack, dynamic versus static type checking, and inheritance versus subtyping. Various conflations of these are the deliberate features of Java's design that make it simple (relative to C++, say). But now that Java is ubiquitous, they start to be harmful too. I concede that just because a language isn't the ideal vehicle for exploring conceptual distinctions doesn't make it worthless as a programming tool—far from it. But the effect of mindshare is huge. The concepts of programming are harder to teach in a world where we must teach Java, because it is not a vehicle for cleanly conveying any of these concepts. My impression is that today's programmers learn fewer languages than ever, so it is harder to establish contrasts. Even diverse languages are being shoehorned onto the JVM, further enshrining Java's [bytecode's] limited vocabulary as a set of fundamentals. It's like some perverse “worse is better” situation, except that the usual point in favour of the “worse” solution, namely simplicity, it not in evidence much. As I'll rant in a moment, Java is a very complex beast below the language level.

Portability is a myth. Java's portability was its key selling point, but it's not clear that it has succeeded. Although Java makes it hard to write code with nonportable semantics, that is only a small part of the portability puzzle. Libraries are a big issue. JVMs are complex enough to suffer compatibility problems among each other, too. So now we just have portabilty problems one level up. I've been sceptical about the drive for portability because in general, no two portability requirements are quite the same. Having “the same” behaviour among every program built using technology X”, which is my paraphrase of Java's portability sell, is primarily a benefit to the authors of technology X, not to the software's users, nor even to the application developers. For example, as Ian Lance Taylor pithily blogged (talking about Tk), saying that applications look the same on any platform means they look odd on every platform. Attempting to insulate the JVM and its libraries from all details of the host system has become a dogma pursued far beyond its usefulness. This theme continues....

As a deployment infrastructure, it's a mess. Again in the name of “portability”, the Java platform tries to hide all trace of the underlying system, defining its own conventions for configuration (classpath, system properties), resource limits (Java heap size), its own archive formats, its own security model, and so on. The problem is that the underlying system always has some way of doing these, and by duplicating this functionality, the end result is excess complexity and hard-to-understand interactions. Even despite massive uptake, the JVM isn't the only runtime. The approach of eliminating complexity by defining “one platform to rule them all” is a hopeless modernist ideal. In practice, it just adds one more platform to the mix, causing a strict increase in complexity. This is the key paradox of attempting to achieving simplicity through portability. It can only succeed if the portable thing completely hides the underlying complexity. This is less and less likely the higher up the stack you try it. Instead we should aim to build systems out of simple (but not necessarily portability-providing) pieces, then achieve systemic assurances by reasoning about whole compositions. Typically, the kind of “whole composition” I'm talking about would be a runtime plus an operating system plus an instruction set architecture. (Interestingly, I contend that stopping at the ISA is sensible; we don't have to go all the way down to microarchitecture. Unlike JVMs, ISAs are low-down enough that they can hide the underlying complexity fairly well.)

As a development infrastructure, it is diabolical. Debugging and dynamic analysis on the Java platform are hugely, horribly flawed. I wrote a paper about one aspect of this a while back. Primary evidence is how the leading implementation (namely Hotspot) implements various profiling and debugging tools (like hprof) using private interfaces (namely the “serviceability agent”), because the pseudo-standard interfaces aren't good enough. And that's in spite of their unfathomable complexity. These interfaces—JVMTI, JDI and friends—are not officially part of the JVM specification, and are pseudo-standards in that no two implementations are quite alike. They also offer inherently poor coverage, because two large chunks of the process's code—namely natives and the VM itself—are respectively not covered, or covered in limited fashion using a disjoint mechanism (which may or may not be implemented). As a result, if you want to do something as simple as observing all allocations of Java objects in your program, you have to do three different (and fiddly) things: bytecode instrumentation for a selection of bytecodes (creating objects and creating arrays), handle JVMTI's VMObjectAlloc callback, write JNI function inteceptors to catch creation from native code. Even then, you're a long way from understanding the memory behaviour of your program, since, surprise surprise, native allocations—including (on most VMs) those made by the VM itself—are actually very significant. There was even a paper at OOPSLA 2010 about this. There are analogous problems in other kinds of dynamic analysis in Java. Try implementing an information flow analysis at the JVM level, and you will be stymied, because information is continually flowing through native code and through the VM itself, and these are wide, complex, undocumented interfaces. (By contrast, doing this at the whole-process level requires only modelling the system call interface, which is manageably-sized and stable.) Java-specific interfaces simply cannot cover, by definition, the whole behaviour of your program.

Debugging technology has gone backwards This is a bit of a specialist rant, so forgive me. It builds on what I just wrote about dynamic analysis. The conventional approach to debugging is takes a simple and high-coverage abstraction, namely the memory image of the debugged process, as the baseline. On top of this, we selectively describe how source-level language features are realised. We does this by documenting the compiler's (and runtime's) implementation decisions, using standard (albeit quirky and complex) formats like DWARF. Apart from GNU Java, no Java implementation I know does this. Instead, they rely on Java-specific interfaces. In so doing, they take a big technological step backwards, for no better reason than expedience. I can excuse VM prototype developers for taking the short cuts of knocking up an in-process debug server with a fixed wire protocol, and building shim “tool interfaces” as thin abstractions of their internal APIs. But in a platform that is the outcome of billions-of-dollars product development, there is no such excuse. As with portability, Java adopts an approach which can only work if it owns the entire system. Since it does not, it fails to eliminate any complexity, and instead just adds more.

As a culture, it's conspicuously herd-like. I suppose this is a universal property of communities. Languages as practical tools tend to take on the properties of the individuals using them. That's why Haskell is over-clever, Python is hack-filled, C is rarely well commented and Java is bureaucratic and verbose. In each case you could say the same for these languages' advocates. My distate for Java owes partly to the fact that it is favoured by not-so-good programmers and middle managers. The effect permeates the community. Java people love reinventing things which existed before—only this time, it'd “pure Java!”. They also seems to love XML, a technology I despise. They outdid themselves by combining these two properties to spectacularly ill effect, as known to anyone who's ever edited a file called build.xml. Reading about Eclipse plugins makes me jaded. I don't know whether it's the chicken (programmers) or egg (Java language) that's to blame for all these problems. Either way, practice has forgotten far too many principles: of keeping it simple, of not duplicating mechanisms, of using information hiding to hide what is change-prone rather than just to make your code more complicated. The principles of object-orientation itself are long forgotten. To illustrate this latter point, I defer to the very nice (developer-focused) talk by Kevlin Henney titled after William Cook's careful statement that “it is possible to do object-oriented programming in Java”.

Here's an anecdote to finish. I just read a research paper containing the following two sentences.

The first kind of behaviors is interface behaviors. (Please note that the name has nothing to do with Java Interfaces.)

Sadly, I can understand why the qualification is necessary. But it really shouldn't be. If we're doing research, what we're doing should transcend the immediate technology, and we should be accustomed to a presentation which reflects that. But even in research communities, too many people equate doing X “for software” with doing X “for Java”, and this is a sad state of affairs.

[/research] permanent link contact

Sat, 23 Feb 2013

A curiously recurring explanation

In conversation recently(-ish), I tried to explain the infamous Curiously Recurring Template Pattern (CRTP) of C++ by relating it to mixins. That just turned one surprisingly tricky problem into two. Here I will try to rectify both problems by proivding a somewhat coherent explanation of how to realise mixins using CRTP.

/* defining a mixin */
template <class Instantiator>
class my_mixin
	int some_state;
	void invoke_feature() { /* using some_state, somehow */ }

/* "mixing in" a mixin */
class built_from_mixins
 : public 
      my_mixin<built_from_mixins> /* , other mixins ... */
    /* ... */

Unlike normal base classes, a mixin is designed to attach at an arbitrary point in the subtyping hierarchy. It's an orthogonal feature that can be “mixed in” to many potential classes, rather than an increment on a particular class. Therefore, rather than referencing a specific base class, a mixin leaves this reference unbound. Nevertheless, it still gives a name to the class it is extending. Here it is called Instantiator. This can be thought of as a “forward reference” to the class that is “mixing it in”. The name is a placeholder for the class that the programmer will extend using the mixin.

(There are other variations on this theme in other mixin-like constructs. Anyone interested in the many meanings of “mixin” could do worse than to start with Richard Gabriel's essay which is based around this subject— though I note that its actual point about incommensurability is deeper, completely distinct, and very interesting!)

Looking at the code, we see there is a cyclical reference chain: from the mixin user built_from_mixins to a specialisation of the mixin itself my_mixin<built_from_mixins> and back (via Instantiator, which is instantiated to built_from_mixins). These references are a bit strange. We haven't even defined built_from_mixins at the point where we use it to parameterise our mixin instance. Why does the compiler even allow this?

The answer is that of course it's allowed, and the compiler allows it simply by virtue of the usual C++ (and C) rules about incomplete types. Cyclic reference among data type definitions is not unique to mixins. Consider a linked list, where it's no problem to create a recursive “next” pointer inside the list node type. Since pointers-to-incompletes are allowed, and the list node type is just another incomplete type at that location in the code, the code compiles fine.

It takes only a small generalisation to apply this not just to incomplete pointer target types, but more generally to incomplete types used as template parameter instances. Of course we can refer to built_from_mixins inside its own inheritance clause—but we can only do things that we can do with an incomplete type. Using it as a template parameter is one such thing—so long as the template's definition is consistent with its parameter being incomplete. In particular, possible usages of Instantiator inside my_mixin, above, are limited if we want to use the incomplete my_mixin as our Instantiator: we can only do the things we can do with any other incomplete types inside a class definition. Happily, my_mixin's definition sticks to this regime, so is well-formed. Moreover, it itself is a complete data type! (Similarly, if you defined a mutually recursive pair of data types using pointers, whichever one of them you defined first in your source file would be complete immediately, even though it contains a pointer to the yet-to-be-defined second data type.) Being complete, our instantiation of my_mixin is fair game for deriving from. This is what allows us to derive build_from_mixins from my_mixin<built_from_mixins>: the latter is complete, even though its type parameter built_from_mixins (known as Instantiator inside the mixin) isn't.

In fact, we haven't used Instantiator at all inside my_mixin. So, why include it? What can we do with it? Well, we can safely use it in any way we can use any other incomplete type: as a pointer (or reference) target type, or as a type parameter. An example is boost's enable_shared_from_this, a mixin which adds the necessary state and member functions for allowing a class to provide a shared_ptr version of its this pointer. You can't safely create a shared_ptr from a regular pointer in general because you don't know where the target object's reference count lives. The enable_shared_from_this mixin fixes this by embedding a pointer to the refcount, in the guise of a weak_ptr subobject, into the mixing-in class. The guts of enable_shared_from_this are basically as follows.

template<class T> class enable_shared_from_this
    mutable weak_ptr<T> weak_this_;
    shared_ptr<T> shared_from_this()
        shared_ptr<T> p( weak_this_ );
        return p;

Just as in our first example, we have some private state and a public interface which implement an orthogonal feature that can be “mixed in” to any class. The mixin-instantiating class T is referenced only in an incompleteness-tolerant way, to instantiate other templates and (eventually, inside the definition of weak_ptr, which is not shown) to define a pointer target type.

I've also seen CRTP described as “virtual functions without polymorphic behaviour”. What does that mean? Since our mixin has a reference to its instantiating class, it can call its methods—even though the binding to that specific class has not yet been formed. In other words, we have deferred the binding of methods—but not until run time. Rather, we have deferred them to later in compile time, when our templates are elaborated. Let's look at an example.

Let's try to get this deferred binding without using CRTP, but also without using virtual functions. Unsurprisingly, it doesn't work. The best we can do is to try non-polymorphic inheritance, by writing something like this.

class X
	void f() { /* unimplemented */ }
	void g() { f(); }

class Y : public X
	void f() { cout << "Done something"; }

Of course, if we call X::g(), it calls X::f() and not G::f(). Using CRTP, we can get it to call G::f() without resorting to virtual functions.

template <class Derived>
class X
	void f() { /* unimplemented */ }
	void g() { Derived::f(); }
class Y : public X<Y>
	void f() { cout << "Done something"; }

CRTP allows the base class to leave certain functions undefined, for definition later in many possible derived classes. The derived classes are not derived the usual way, though: they are derived using CRTP, passing the derived class back to the base class as a type parameter.

This sets up a now-familiar kind of cyclic reference: the base class refers to its (potentially many) derived classes, through its template parameter. Having this “forward” reference, down the inheritance hierarchy, as well as the usual backward reference up it, is what allows derivation-time binding. It's also limiting: we can't subsequently “re-override” Y::f(). Y's method bindings are fixed at derivation time. We have to create a new specialization of X and derive immediately from that, using some other means to get at Y's functionality if we need it.

Interestingly, note that it's okay for us to do Derived::f() in our X class template. This is surprising because at the point where we instantiate X, Derived is an incomplete type. I mentioned earlier that we were restricted in what we could do with incomplete template parameters, but in this case, here we are happily calling a method of ours. At definition time, there are at least two possibilities for the code that must be generated at the site of our call to Derived::f(), because f() might be an instance member function or a static. (It could also be a function object overloading operator().) When we instantiate X, if we read the code strictly top-to-bottom, we haven't yet got to the body of Y, so it is not yet decided whether it will define f() as a static or an instance member function. Somehow, the compiler is examining looking ahead at the definition of Y at the point where it elaborates X<Y>, even though Y cannot be complete yet (because we're in the middle of elaborating its base class). I must confess, this is where my understanding of the C++ language runs out. Thte standard contains complicated rules about name lookup in templates—principally the infamous “two-phase name lookup”. In short, “unqualified names” in templates are looked up at template definition time, whereas “qualified names” are looked up at instantiation time. Clearly our use of Derived::f() is a qualified name. No doubt there is some part of the standard detailing how the second-phase lookup is permitted (and required) to look into an incomplete type, namely Y in our example (incomplete at the time of X's instantiation), to understand the meaning of Y::f(). I haven't yet found a reference to it though. If anyone can point me to it, I'd be much obliged.

[/devel] permanent link contact

Thu, 10 Jan 2013

Systems versus languages

Somewhere buried within one recent magnum opus in these pages I highlighted a contrast between systems and languages. I also noted that OOPSLA conspicuously contained the word “systems” in its title. This juxtaposition probably seems incongruous to some, but it is one close to my heart. I see myself as a researcher tackling problems usually labelled as “programming languages”, but with a “systems” mindset.

Richard Gabriel's very interesting piece about incommensurability in scientific (and engineering) research, using mixins as an example, makes some remarkably similar observations. (There's also a video of a presentation from last year's ClojureWest, but I haven't yet watched it.) He distinguishes the “engineering” papers that first discussed mixins from the “scientific” papers that subsequently studied and formalised them. And, without disparaging the latter, he very much emphasises the underrated value of the former. (He also makes some very incisive observations about apparent misunderstandings and omissions in the influential Bracha & Cook paper. A major point, of course, is that they need not be misunderstandings per se—incommensurability is the posited explanation.)

It's nice to get a historical perspective on these matters, from someone like Richard Gabriel who has seen a lot of ideas come and go. In part, his appraisal of the changing role of engineering papers offers me some reassurance that I might not be crazy. In conversation with other researchers, it can be frustrating that declaring my interest in programming languages is taken so often to imply that I do theoretical work. But I am more interested in languages as systems—my interest has to do with their “useful-for” properties, not “abstractly-is” properties. (Of course, I'm glad people are working on the latter, and I try to relate their findings to what I do... while accepting that it's not what I do.) Another interesting historical tidbit is that Gabriel estimates that my kind of work—engineering approaches to programming problems, you could say—was “outlawed” from (yikes!) roughly 1990 to 2005. I suppose it's handy that I started my PhD in 2006. The feeling of “separate camps” is still very much there, of course.

[/research] permanent link contact

Thu, 06 Dec 2012

Bridge that gap

I don't do IDEs.

(This post isn't really about IDEs, but humour me.)

I used to do IDEs, back in my Turbo Pascal days. And I want to now: I value the things they promise. But the reality, for me, always seems infuriating and limited. They get in my way more than they help me.

One example: Eclipse. I used Eclipse for Java programming back when I was an undergraduate, and mostly bent it to my will. Since then I've done fairly little Java programming, and a combination of mind-rot (mine) and evolution (Eclipse's) has left me sufficiently unfamiliar with recent versions that I am unable to use them effectively.

I just tried again. I wanted to open a third-party Java codebase's Eclipse project that I had downloaded from that codebase's web page. I am in the Java perspective, and I click the “Project” menu, hoping it will let me “Open project”. But no! “Open project” is greyed out. Woe is me.

Greying out UI elements is a terrible, terrible practice that should never have been invented, because it doesn't tell you why something was greyed out, so leaves you clueless about how to ungrey it. But no matter. Being a researcher, two ideas occur. Both are ways in which we could, for “little cost”, add back the ability for a program to explain why a widget is greyed.

Idea one: we should really be using some modern programming technology under which the “greyedness” state is described declaratively. In that case, the program's code will contain a concise description of exactly what causes the menu item to be greyed. I could check out the code to understand why it was greyed. Or perhaps this condition could be reflectively exported to the user. Either way, if only we had written our UI in some declarative style! Then our problem would be easily solved. But, alas, we haven't written it like that. Instead, greyedness is the emergence of some maze of twisty little imperative procedures.

Idea two, luckily, is more immediately applicable. Let's “shadow” each UI element with some extra information to do with its greyedness or otherwise. When we set its state to greyed, we snapshot some property of the program state, like the address of the caller who is turning on greying, or the whole callchain, or whatever. Then I can somehow query this shadow—maybe by attaching a debugger to Eclipse, or whatever—and have it tell me why it was greyed.

A thought occurs. Is this a general-purpose approach to “programs that explain themselves”? (Hat-tip: “Functional programs that explain their work”, by Perera, Acar, Cheney and Levy, seeded that phrase in my mind, although the general idea has been with me for much longer.) Interactively querying for explanations of program state or program output seems like a general, powerful tool, both for programmers programming and users UI-using.

Aha! you might say: this could work for a Dr Expert Developer , but there's a problem for poor Joe User. The “explanations” will be in terms of program objects, not “something the user understands”. Therefore, you would argue, my approach is not very useful except to niche-case users like me. But to this, my rejoinder is: why shouldn't they understand? If an object-oriented program is properly abstracted, it will have a “surface” level of objects which model the domain as the user sees it. Dialogs and buttons and widgets and text entry areas are all just objects, and users understand these more-or-less fine (accepting caveats from the HCI audience).

It seems to me that paving this continuum, between user-facing and program-level abstractions, is one of the great promises of object-oriented programming. I wouldn't say it's fulfilled, but then, I wouldn't say we program in a terribly object-oriented way most of the time. When I was quite a bit younger, I found it odd that Alan Kay would simultaneously have been working on user interfaces and on object-oriented programming. More recently I think I have begun to grok this connection. My latest Eclipse problem is yet more evidence of why it's a useful one.

This connection is also one which functional programmers of the Lisp-y, Scheme-y schools understand. The abstractive power of these languages is supposed to be used—I've heard it said, at least—to construct domain-specific abstractions which naturally model the elements of the program's work (its “objects”, you could say, except that here we won't). In this way, the program itself is lifted up to a level of abstraction which the user, by definition, would understand. (Emacs users might be able to tell me how well this can work out... anyone? Oh, all right.) I lean more towards the object abstraction than the lambda, but perhaps it's six versus half-a-dozen.

Perhaps disappointingly, that's mostly the end of this post. But I can't resist appending a rant about my own work. What I'm convinced of is that the object abstraction is everywhere... it's just latent a lot of the time. It's in C programs, functional programs, windowing systems, filesystems, OS kernels, spreadsheets, web browsers, IDEs. It's latent in any chunk of state at all. Yet the promise of seamlessly bridging that gap—of constructing a continuum between the programmatic and the user-facing—is not yet with us. That's because there are hundreds of different ways in which this state has been constructed and encoded, and no infrastructure unifies them. Classically, we can only shrug and say that people weren't enlightened enough to write their code the “right way”, using the right languages, targeting the right runtime infrastructure, and so on. But more likely, those “right” things were never fully realised, and never will be. Either way, what we need is a postmodern object-oriented runtime: one that can find the object abstraction, and present it to us where it exists, even if that presentation is something of an illusion—an adaptation, you could say—of reality. (This concept is also similar to views in a relational database.)

What would such a runtime look like? What's a precise definition of the problem it's solving, even? Both of these are hard questions to which I have no complete answer. But it's no coincidence my PhD was about adaptation (although I'm not saying you should read it). And my more recent side project on DwarfPython (that I would love to pursue, if only I could find a way of getting paid for doing so) is also tackling a subset of the same problem space. Instead of a language implementation owning its object representation, can we build one which gives up that ownership, but rather takes on the job of “seeing” arbitrary chunks of state as objects it can manipulate? The idea of DwarfPython is to use object producers' debugging information to do just that. More generally, in-memory state is not the only kind of object; I also have an interest in unifying files with objects. Again, rather than the classical approach of persistent programming languages, we can take a more postmodern approach in which each file has the freedom to represent its contents, or “object state”, in a different way, or a variety of ways, subject to appropriate interpretation. This is thematically similar to the work of my Bachelor's dissertation; although that pursued a far too classical approach, it was still trying to unify files with objects. So, finding the object abstraction in unusual places seems to have been a theme of my work from day −1, even if I didn't realise it at the time....

[/research] permanent link contact

Mon, 03 Dec 2012

Tools or not tools

Jim Coplien's keynote at SPLASH this year was a peculiar one. It featured two particularly provocative sentiments: firstly that too much abstraction is a bad thing, and secondly that building tools is not what we want to be doing. (The latter was actually due to Dave Ungar, during questions, but met with vociferous agreement from the speaker.)

These remarks met with some puzzlement from much of the audience, judging by a series of apparently disparaging remarks during subsequent conference talks and question sessions. Who could disapprove of abstraction or tools? I think there are some reasonable answers to that question; what follows is my attempt. (I have no idea whether it coincides with Coplien's or Ungar's.)

Abstraction all the way up

The abstraction issue is perhaps not terribly controversial. Most programmers are aware that abstractions present a trade-off. The temptation to abstract endlessly can be a rat-hole that distracts from actual progress on the task at hand. Ian Lance Taylor once blogged a fairly similar opinion. If you abstract too far, you abstract away essential features of the domain, rendering it unrecognisable. This yields “abstractions” that are actually complex, not simple, to use. Abstracting over multiple use cases, i.e. generality, is a common offender here. For example, rolling your own implementation of a graph algorithm can be easier than figuring out how to tame the monstrous generality of something like the Boost graph library. (Pardon my C++; no doubt you can think of your own examples.)

Sometimes, abstractions exploit specificity, by packaging up common case usage patterns. This can be very useful. In fact, in interesting counterpoint to the Taylor piece above was Rustan Leino's note about loop structures in a subsequent SPLASH keynote: inferring loop invariants is one of the hard problems faced by any verifier. By constraining the form of a loop, it becomes easier to find its invariant. Abstract loops are an extreme case of this, since the loop itself is in library code and not in user code, so the invariant need be found only once. But of course, just as Taylor hinted at, any user forcing themselves only to use such loops will end up spending rather a lot of time structuring their code to accommodate this constraint. (In this way, it shares a lot with other syntax-directed reasoning tools, including type checkers. These tools are superficially easy to market—hey, look, it shows you bugs in your code. But there is a hidden cost to using them, deriving from implicit constraints on how you can structure your code such that it interacts nicely with the tool. If you don't stick to these, your tool fails in some way, like false-positive type errors or solver timeouts.)

To end my rants about abstraction on a complaint, I could also roll out one of my previously-blogged complaints about common styles of functional programming—with liberal use of polymorphism, or unification of higher-order with “ordinary” operations (think currying, or juxtaposition-is-application), code can become needlessly hard to read. Lazy languages add the unification of storage with computation, which I concede is sometimes an incredibly useful abstraction, but easily makes the memory behaviour of your program incredibly difficult to understand.

What about tools?

For me, the most interesting issue concerns tools. Dave Ungar phrased it something like as follows: “if every bolt under my car had a little handle on it, I wouldn't need to get out to go and get a wrench”. So, let me frame the contrast I believe he was making as one of tools versus run-time systems. Dynamic O-O environments are very much systems, geared around the ability to push new capabilities down into the system's fabric, rather than having them sit on top. This “fabric” is what emerges from combining the messaging metaphor (messages are fundamentally proxyable) with dynamic extensibility (adding new messaging behaviour is a local change during runtime, not a far-reaching change at compile time). As I have rambled about previously, the lower some functionality is integrated into a system, the more pervasively available it is, so the more power and leverage it confers. Smalltalkers and other dynamic language advocates know this. It's a very tricky thing to convey to the unfamiliar. It's even harder to measure. Most of us don't use runtimes that have this amount of dynamism and immediacy, although Javascript may yet change that. Operating systems, not least Unix, are also dynamic runtimes in this way, although their inability to see inside application means (unfortunately, and avoidably) that a large amount of useful code and data (hence potential “extension”) is opaque to them.

Tools are fragmentary; runtimes are integrating One reason people develop tools and not runtime extensions is that integration is hard. If you write a command-line tool, you get to define its input domain, output domain and behaviour from a clean slate, according to your convenience. This is often (though not always) easier than plumbing something into a runtime, which is a bunch of code somebody else wrote. But let's imagine making the leap. To get slightly more concrete, suppose the “tool” we're interested in is a dynamic analysis tool—pick your favourite bug-finding, race detection, memory leak detection or other tool that fits the general mould. What's better about having it as a “runtime” rather than just a “tool”? Well, its functionality would be embedded right there in your running program. As a consequence, it supports exploratory, interactive, programmatic use. If you dropped to a REPL in your program, the innards of the tool would be laid out across your program state, pushed into fields on program objects. If your tool is a race detector using a lock-set algorithm, for example, then each object's lock-set would be accessible as a field on that object. If you're using timestamps or vector clocks, they would be there too. You're also not stuck with a fixed amount of insight the tool's authors saw fit to provide (e.g. when trying to track down the source of a data race); the tool's code is a service you're free to extend. Getting interactive, exploratory, programmatic usage seems like a useful payoff for the effort of integrating your tool into a runtime. Arguably, then, the challenge is building runtime infrastructures that are not unduly difficult to extend like this.

Progress? Is the “tools not runtimes” tendency getting stronger? “Systems, languages, applications” is the conference title's invariant. “Tools” is nowhere to be found. My vague impression is that today's programming research is more tool-focused, and less system-focused, than 20--30 years ago. (A near-dual property is also true: self-proclaimed “systems” research has less programming focus than it used to. I used to bemoan this frequently while doing my PhD in the systems research group in Cambridge.) But why? Simplistically, we might just say that integration is hard. I think there is something more subtle at work. Integration of research techniques into runtimes arguably scales poorly—since we all have to integrate into the same runtime, we have to achieve consensus on that runtime's interfaces. Tools, being freestanding and piecemeal, arguably scale better. You could say that lots of small, freestanding tools are the postmodern way, whereas “one true runtime system” is a classical ideal. (It's fitting that Noble and Biddle's “Notes on Postmodern Programming” was recognised at SPLASH this year for its influence, in the Onward! strand.)

Avoiding classical fallacy In the battle of the classical versus the postmodern, normally I side with the postmodern. How can we square this with the desire for the benefits of the runtime approach as I described it? I should probably save my thoughts for another post. But two ideas come to mind. The first is one I've already mentioned: design a runtime infrastructure that is specifically easy to extend. But that seems to be begging the question: if we knew how to build this magical runtime, and what kinds of extension it would need to support, we'd already have done it and solved the problem ages ago. For this reason, we also need the second idea: we need to get into the mindset of tolerating a lot more heterogeneity. Very briefly, it means pushing radically downwards our notion of “runtime” so that most of the typical implementation decisions of an object-oriented runtime, such as dispatch mechanisms, introspection mechanisms and object layout, are actually user-level decisions in such a system, but still recognisable as the abstractions they represent. In other words, our system can descriptively extract extract latent object abstractions from the contexts in which they emerge in existing systems, given descriptions of these latent abstractions. This contrasts with traditional runtimes, in which the object abstraction constructed by the runtime implementor in a way that is prescriptive. And hey presto, we are back to my VMIL 2011 workshop paper: we already have a very powerful descriptive technology, in the form of debugging infrastructure for native code; our task is to bend it to this new purpose. So, end of rant for today.

[/research] permanent link contact

Wed, 27 Jun 2012

32 bits should be enough for anyone

For a brief while, 32-bit Intel machines were the de facto standard in commodity hardware , and life was simple. Certainly, it's an ugly architecture, gigantically overcomplicated by backwards-compatibility. Its virtual addressing features are terrifying. But the subset of it which user-level programmers on modern OSes use is fairly comprehensible. There is an ISA-defined stack with its own registers and a well-defined calling convention. Pointers and most integers are both 32 bits, meaning that the “word” is a useful and well-understood unit of storage.

All this changed in the early 2000s as AMD's 64-bit extension of the ISA came into popularity. Upon us were forced bigger integers, bigger pointers, and a more complicated stack and calling convention (in the name of “performance”, but at huge cost in complexity). I believe these were completely retrograde steps. Now that pointers are 64 bits, our software's memory footprint and disk footprint are bloated considerably. To “alleviate” this, and to avoid certain paradoxes about the size relationships between short, int and long, an int in most C compilers stayed at 32 bits. Unfortunately, this is completely braindead, because int is very much supposed to be a word-sized type. This is the reason that C's “defaults to int” semantics, as applying to unprototyped functions and untyped variables, are sane.

Does this matter? Yes! Here is some code that was mysteriously segfaulting for me this morning. It's from DTrace, or more specifically, Paul Fox's Linux port of it.

if ((P->rap = rd_new(P, P->pid)) != NULL)
  (void) rd_loadobj_iter(P->rap, map_iter, P);

Somehow, the pointer returned by the rd_new---which just wraps a simple calloc() call---gets corrupted immediately after returning. Suspiciously, said corruption is that the top four bytes are 0xffffffff, whereas the lower four bytes are those of the pointer returned by calloc(). Inspecting the disassembly around the rd_new call, we see something suspicious.

   0x000000000047ed12 <+150>:   callq  0x462bc6 <rd_new>
=> 0x000000000047ed17 <+155>:   cltq   

What's this cltq thing? Well, it takes the lower 32 bits of %rax (the 64-bit register holding the return value from rd_new()) and sign-extends them to fill the full 64 bits. This is exactly the corruption I was seeing. Why did the compiler insert this unwanted instruction? The answer is revealed if we recompile the file with -Wall.

Psymtab.c:645:3: warning: implicit declaration of function `rd_new' [-Wimplicit-function-declaration]

The function is not prototyped, so its return value defaults to int. But because int is now 32 bits wide, and the register holding the return value is 64 bits wide, the compiler helpfully obliterates the top 32 bits of the return value for us by sign-extending the lower 32 bits into them. If the compiler implementors had stuck with the intention of the int data type, that it be exactly a word in size, and therefore that “defaults to int” is sensible, this would not have arisen.

Now, this is clearly sloppy code. We should just fix it so that rd_new() is prototyped. It probably seems a bit of a nonsequitur that I am blaming this problem on 64-bit architectures. But on the other hand, how often in your code have you actually wanted integers that can store values in excess of 232? If you are a systems programmer, you might value the ability to encode large offsets. But we already had long long for this. In other cases, the vast bulk of our code deals with small integers, characters and pointers. Giving us an extra 32 bits of width in our ALU operations is a waste of transistors.

Why did we waste them this way? Well, we had to waste them somehow. In the early 2000s, we didn't really know what else to do with them, because (I suspect) there was little perceived demand for multiple cores in the commodity market (outside of servers). Nowadays, we have even more transistors, and even hardware guys realise that giving us 128-bit architectures would be a pointless waste. So, they spent some effort convincing us that we really did want multiple cores after all. And now we are busy complicating our software so that we can “exploit” these too. I have ranted before about how that risks generating a generation's worth of bad software. Perhaps I should say “another generation's worth”.

By the way, I'll concede that 64-bit address spaces can be useful, if they are used to support persistence or sharing. No need for pointer swizzling! But AMD's 64-bit x86 extensions do not provide the separation of protection from mapping to realise the sharing use-case. In other words, switching protection domains still means invalidating the TLB entries of shared mappings. Meanwhile, I haven't seen anyone using the extra address space for accessing persistent storage in a radically new way, although I'd love to see some approaches in this space.

I don't completely doubt the value of multiple cores either. The right way to see parallelism is as an enabler for radically more computation-intensive applications---likely to be in domains such as scientific computation, machine learning, and simulation---than what we can currently support. As I have also ranted about before, I am deeply disturbed by the fervour for mass rewriting of everyday software, and the disruption to the infrastructure it runs on, that is resulting from mindless multicore mania, in the same way that the 64-bit shift has disrupted our infrastructure. It's all in the name of performance, but it costs us far more of human beings' time and energy than it saves.

[/devel] permanent link contact

Fri, 01 Jun 2012

Link order

Initialization order of static state is a thorny problem. It's particularly tricky to get right portably. But until recently I didn't realise how tricky it could be even when restricting oneself to GNU tools on Unix platforms. Consider the following three-part program, consisting of an executable prog and two shared libraries lib1 and lib2. The dependency order is left-to-right in that list: prog depends on lib1 which depends on lib2.

/* prog.c */
#include <stdio.h>

/* from lib1 */
void greeting(void);

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
	fprintf(stderr, "Initializing prog\n");

int main(void)
	return 0;

/* end prog.c */

/* lib1.c */
#include <stdio.h>

/* from lib2 */
void hello(void);

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
	fprintf(stderr, "Initializing lib1\n");

void greeting(void)

/* end lib1.c */

/* lib2.c */
#include <stdio.h>

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
	fprintf(stderr, "Initializing lib2\n");

void hello(void)
	printf("Hello, world!\n");

/* end lib2.c */

Here's a GNU Makefile to tie it all together.

CFLAGS := -g -fPIC
LDFLAGS := -L$$(pwd) -Wl,-R$$(pwd)
LDLIBS := -l1 -l2

default: lib1.so lib2.so prog

%.so: %.c
	$(CC) $(CFLAGS) -shared -o "$@" "$<"

	rm -f lib1.so lib2.so prog

Now when you do make (or gmake) it will build a program that initializes its libraries in right-to-left order: from the “most depended on” to the “least depended on”. We can verify this by running the program.

$ ./prog
Initializing lib2
Initializing lib1
Initializing prog
Hello, world!

Moreover, if you try flipping around the link order in the LDLIBS line, the link will fail with undefined reference to `hello', because the reference to hello (in lib1) is introduced after the reference to lib2, and the linker's defined behaviour is to avoid re-scanning for new undefined references---it's up to the invoker to order the libraries so that this works.

Let's try this on a BSD system. I have a NetBSD 5.0 VM hanging around, so I'll try that. It has recent GNU make, GCC and GNU binutils installed.

$ gmake
cc -g -fPIC -shared -o "lib1.so" "lib1.c"
cc -g -fPIC -shared -o "lib2.so" "lib2.c"
cc -g -fPIC  -L$(pwd) -Wl,-R$(pwd)  prog.c  -l1 -l2 -o prog
$ ./prog
Initializing lib1
Initializing lib2
Initializing prog
Hello, world!

Strangely, our initialization order is flipped. This doesn't matter for our program, but if lib1 consumed some static state in lib2, it would matter quite a bit. What happens if we flip the link order around to compensate? We edit the LDLIBS line and re-make.

$ nano Makefile
$ gmake clean && gmake
rm -f lib1.so lib2.so prog
cc -g -fPIC -shared -o "lib1.so" "lib1.c"
cc -g -fPIC -shared -o "lib2.so" "lib2.c"
cc -g -fPIC  -L$(pwd) -Wl,-R$(pwd)  prog.c  -l2 -l1 -o prog
$ ./prog
Initializing lib2
Initializing lib1
Initializing prog
Hello, world!

This has done what we want. But what's going on? This link order didn't even work on GNU/Linux. Not only does it work on BSD, but it's required if we want a sensible initialization order. Our initializers run in left-to-right order, so we need to put the “most depended on” libraries first, not last. This isn't a BSD quirk per se, because we're using the GNU linker in both cases. I suspect the linker scripts are nevertheless different in the two cases. However, I haven't had time to look into the details of why. I'd be interested to hear, if anyone knows. I guess this is the sort of pecularity that gives libtool a reason to exist.

[/devel] permanent link contact

Metacircularity (or: “what's Java got to do with it?”)

Before I begin, here's an executive summary. Metacircularity is not about self-interpretation at all. Rather, it's an engineering approach to re-using as much code as possible between different parts of a toolchain (including compiler, runtime and debugger). This is noble, but limiting ourselves to working in a single language is needlessly restrictive. If we get over our presumptions about“language barriers” (cf. Oracle's disappointing attempt at explaining metacircularity), we can apply the same re-use philosophy to supporting a diversity of languages, not just one.

I've recently found myself wanting to understand the concept and purpose of metacircularity in language implementations. This is because I've become interested in understanding the Maxine JVM, which is often described as having a metacircular design.

All this meant to me at the time was that it's written in Java---making it a self-interpreter, certainly. But is it meta-circular? What does that mean? Why might it be a good thing? As I will show, if we're even just a tiny bit pedantic (which I hope we are), then metacircularity is not really a special case of self-interpretation, but a completely separate concept.

I've found two very helpful papers describing metacircularity. The first is that of Chiba, Kiczales and Lamping talking about the “meta-helix”. The other is Ungar, Spitz and Ausch describing the Klein VM, a metacircular implementation of Self. The first paper really emphasises the idea of implementation by extension which is at the heart of metacircularity. They note that use of metacircularity [is] “to allow extensions to be implemented in terms of the original non-extended functionality”. As the paper goes on to discuss, there is a tricky bootstrapping problem inherent in this. If we don't keep careful track of the dependencies between all these extensions, subtle and not-so-subtle bugs, most obviously infinite recursions, can arise. The paper is about avoiding confusion of metalevels, and as they propose, the shape of a helix, not a circle, makes much more sense in describing what supposedly meta-circular systems are actually doing.

The second paper, by Ungar et al, is more of a practitioners' view: it shows what VM builders consider to be a metacircular design, and what they hope to achieve by it. After reading these two papers, reading some other things, and scratching my head a lot, it became apparent the primary goal of metacircularity is to solve two very practical engineering problems concerning re-use: re-use of code and re-use of debugging tools. They mention the code re-use issue directly, by saying that in traditional designs “an operation such as array access must be implemented once for the runtime, once for the simple compiler, and once for the optimizing compiler”. The question of tool support is also an explicit motivation in the same work: they lament that in the non-metacircular Self VM, “to inspect a Self object... [in] an application that has crashed the VM, one must invoke a print routine in the VM being debugged, [an approach of] dubious integrity”, that the VM “must be able to parse both Self and C++ stack frames”.

So, perhaps prematurely, I'd like to propose a new characterisation of metacircular VMs that I think captures their true nature. Metacircular VMs are interpreters (in the general sense) that are carefully constructed to have a dense re-use structure. They do this by expressing as much as possible---front-end, compiler, runtime---as extensions over a small common core. This core has some interesting properties: it is designed explicitly for extension, and includes the necessary foundations of debugger support. It is the former which allows code re-use, and the latter which enables the same tools to see all the way up the stack, from various levels of VM code up to application-level code.

Note that this is fundamentally different from a self-hosted compiler. Under self-hosting, the compiling compiler is not re-used at all in the compiled compiler. It is just used to implement a translation function, from end to end; how it does it is completely opaque. By contrast, in a metacircular VM, you can invoke your hosting runtime to perform part of your work---asking it do “do what you would do in case X” by calling the corresponding function (or sending the corresponding message, for Smalltalkers). The trick is to ensure that these helper requests are correct and well-defined, meaning they do not cause infinite regress (the obvious bug) and do not confuse meta-levels (the more subtle bugs mentioned by Chiba et al).

As a consequence of this fundamental difference, the key challenge of metacircularity is not just that of “implementing language X in language X” it's dealing with the bootstrapping problem. What is a suitable common core? How can we make it small? What extensions must it permit, or may it permit? How can we structure the extensions on top of one another, so that they can express what we want, re-using what we want to re-use, and efficiently?

So, we've established that “self-interpretation” is irrelevance. But it seems that most metacircular designs are, in fact, self interpreters, right? I actually consider this to be false. Even when staying within “one language”, the fundamentals of the bootstrapping process means that at a given level in the interpreter, certain language features may only be used in restricted ways. Sometimes these restricted subsets are given a name, like “RPython” in the PyPy project. In other cases, they are not named. But in all cases, there are restrictions on what functionality at some level in the system it is is safe and meaningful to invoke at the metalevel. Indeed, this is exactly the “helix” shape that Chiba et al were describing. In other words, different parts of the interpreter are written in different sub-languages, precisely in order to avoid infinite regress. Just because there is continuity between the core language and the eventual top-level language doesn't make them “the same”, and for this reason, metacircular VM designs are not self-interpreters.

If I were to write a ranty summary of the above paragraphs, it would be that the apparently “beautiful”, head-twisting, recursive, quasi-mathematical aspects of the metacircular design---the things which language nerds get excited about---are both irrelevant and illusory. Metacircularity is motivated by engineering pragmatics, not “deep” linguistical or mathematical concepts. (Homoiconicity, itself a concept of overrated interest, is an orthogonal concept to metacircularity, despite what at least one blogger has written.) I believe this fixation with superficial observations about language stems from the documented inability of many programmers to divorce concepts from language. (For “documented”, I can say at least that Lamport has commented on the problem, and in this case I agree with him. I have big disagreements with other parts of the same article though. I will post about those in the near future.)

So, having stated that re-use is the good thing about metacircularity, why is re-use so much easier in a metacircular design? The reason is that we have a common core providing a coherent and adequate set of services---the services embodied in the “bootstrap image”. And I say “services” and not “language” for a reason. The core really is a set of runtime services. As I have explained, is only a distant relation of whatever high-level language the author is intending to realise. In our current technology, re-using code and re-using tools across languages is hard, and so “build everything in the same language!” seems like a useful answer to a VM author's problems of API-level interoperation and of tool support. Metacircular designs are the result (because it's the closest you can get to doing everything in one language). But as I've just described, the “same language” property is an illusion, and there are inevitably many languages involved. It just happens that in current projects, those languages are designed to be as similar as possible to one another---featurewise increments, in effect. But instead of this unimaginative perspective, anyone building a metacircular VM should ask themselves: how can I design my core services---the core of the VM---to support as many different languages as possible?

This will sound familiar to anyone (so, hmm, maybe ten people on the planet) who has read my “Virtual Machines Should Be Invisible” paper. Although it doesn't approach the problem from a metacircularity perspective, this paper is all about building an infrastructure that can support a diverse variety of languages, sharing code and tools between all of them.

Currently, our shared base infrastructure is a POSIX-like operating system. Every VM author (even those interested in Windows, which I'm cool with) implicitly targets this infrastructure. Unfortunately, these systems don't provide enough abstractions. As such, different language implementors build their own infrastructure which reinvents similar abstractions in incompatible ways---including functions, objects, garbage collected storage, run-time self description, exceptions, closures, continuations, and so on. We can clearly avoid this pointless diversity without sacrificing innovation. Just as with operating system interfaces, there is never complete quiescence or consensus, but we still manage to share a lot more software between OSes than we did in the pre-Unix or pre-POSIX days.

One of the mitagating techniques which my VMIL paper describes but which metacircular designs don't use is: describe your implementation decisions. Don't encapsulate them! If you implement a certain language feature a certain way, describe it. There is nothing fragile about this, because your descriptions will be written in a standard way and consumed by an automated interpreter---called a debugger. This is what native debugging infrastructure does. VM-hosted debuggers, of the Java or Smalltalk flavours, don't do this. To make the value of this approach clear, let me finish with another example from the Ungar paper, where they proudly state that Klein VMs can be debugged remotely, and in a post-mortem fashion, using another Klein or Self VM. “A separate, possibly remote, Self VM hosts an environment that manifests the innards of the Klein VM at the source level. Thanks to Klein's metacircularity and Self's mirror-based reflection model, Klein can reuse a vast amount of already-written Self programming environment code.”

What the authors are forgetting here is that this is not a new facility. Native debuggers have long had the capacity to inspect remote processes. Smalltalk-, Self-, and Java-like designs took a retrograde step by forcing debugging to exploit the help of a server within the VM. Although this has the benefit of allowing the debugger implementation to share the introspection services already present inside the VM, it requires a core of the VM to remain working correctly, even after a failure, which precludes many cases of post-mortem debugging. By contrast, trusty (or crusty? your choice) old native debugging is necessarily designed for this as a common use-case.

The approach I advance instead, as described in the VMIL paper, is to implement introspection on the same infrastructure that supports remote inspection---which happens to be the DWARF infrastructure, in the case of DwarfPython and modern Unix-based compiler toolchains. This is very similar to the Klein approach, in which mirror objects may reflect both local and remote state. But it completely avoids the myth that we should implement everything in a single language. Indeed, debugging information formats like DWARF are actively concerned with supporting a wide variety of languages. One Klein process can inspect another because they share a set of implementation decisions. By contrast, a native debugger need share nothing with its debuggee, because native debugging infrastructure includes a facility which is fundamentally omitted from VM-hosted debuggers: the language implementation explicitly describes its own implementation decisions. It does this down to the machine level, and moreover, up from the machine level.

The result is that given a memory image of a crashed program, we can recover a source-level view of its state at the time of a crash. VM-hosted debuggers are fine for user code because encapsulation and memory-safety protect enough of the VM implementation that the debug server can still work. (Notice I don't say “type-safety”! Type-safety is just an enforcement mechanism for encapsulation, not the key property that ensures encapsulated state is not corrupted.) These VM-level guarantees do not have such a nice property if the failure was due to a bug in the VM itself. This is because the invariants of the VM's own data structures are by definition broken in this case. Some might argue that this is a minority use case, so VM-hosted debugging is fine for general use. Personally I don't mind, as long as I have a debugger that can see all the way down. Currently this doesn't include any VM-hosted debugger, but perhaps it could do. (One of my perhaps-future small projects is to create an implementation of JDWP that knows how to answer queries about native code.)

In summary, I think of the solution to re-use proposed by metacircular designs as a degenerate case of the approach I am pursuing. It sounds strange to most people, but it is not too much to ask for a language-agnostic runtime infrastructure that supports a plurality of language implementations (going right down to native code), direct sharing of code and data, and orthogonality of tool support from language. As I ranted about in the VMIL paper, this infrastructure is a modest and very feasible generalisation of what already exists, with basically only performance questions outstanding. (I'm working on it.) Given this infrastructure, the same careful bootstrapping approach can be used to share code and retain tool support throughout higher-level language implementations. But we can do this without the requirement that everything be in a single language, which doesn't make sense anyway.

[/research] permanent link contact

Tue, 20 Dec 2011

Cathedrals, bazaars and research groups

[Post-hoc clarification: at the time I wrote this rather grumbly post, I was working in the Department of Computer Science at the University of Oxford. It doesn't necessarily reflect any on other institution whose domain you might currently be seeing in your address bar!]

A fe months ago I finally got around to watching the video of Guy Steele's “Growing a Language” talk from OOPSLA '98. It's a spectacularly entertaining and insightful talk.

(It's also a nice demo of how a good keynote doesn't have to be Earth-shattering, as long as it's compelling in concept and delivery. Really, the meat of the talk is quite specific: it's about how language evolution should be managed, with particular reference to the then-ongoing attempts to add two features to Java: generic data types, which we all know and love, and operator overloading, which still hasn't made it.)

It was a nice reminder of the two “organisational modes” of collaborative effort that Eric Raymond called The Cathedral and the Bazaar. Building software is one activity where these metaphors apply. Designing languages is another. Research groups are a third.

Like language design and the construction of any large software project (think Linux), research groups aren't a “fully collaborative” activity. Rather, they are “partially collaborative”---it's not that everyone is working with everyone else, but rather, different participants are interested in different pieces of the overall puzzle. There will always be multiple frontiers of progress open concurrently---but all building on a shared central core.

When I was in Cambridge, the group I was in was very much a bazaar in style. There was no unique leader (but rather a gaggle of four or five faculty). Group communications revolve around a mailing list and weekly meetings where discussion was open, informal talks were and anyone would be free to raise questions big and small.

It wasn't a problem-free group, either in general or for me personally. For my first year in the group, the bazaar was dead. That was a tough time---mainly because communication structures reverted to small cathedrals (and I wasn't really a part of any of them). Even later on, I must admit I didn't always feel completely at home. I was a programmer-oriented researcher in a performance- and applications-oriented group. But in hindsight I appreciate that the group's bazaar-like communication structure and ethos were a very good fit for me, even if the topic selection wasn't great. By the end of my PhD, I found I was getting some reward from my participation in the group. in two ways. For one, my work had gained some degree of recognition in the wider group---I felt I had, in my own small way, “shaped the agenda” at least in a tiny corner. (Sadly this was not enough to get others on board with similar work, but also not miles away from that either.) For another, I had participated in the more topic-independent community aspects of a research group---organising the talks for a while, participating in discussions at talks and on the mailing list, being around, organising events, and so on.

I was recently lamenting to myself---a favourite pastime of mine---how right now, my work isn't a “part of” anything. Nobody cares about what I'm doing, or so it seems, and conversely, I find it hard to get enthused about what those around me seem to be doing. But then again, I have very little idea of what their work is, nor they of mine. There is a lack of transparency and a consequent lack of spontaneity. Without cross-linking communication structures, there just aren't the opportunities to spot synergies and find common interests. I have found this a bewilderingly opaque and unsatisfying environment almost since I arrived, but I only recently realised the reason: that it is a severely cathedral-organised group. There is no institutionalised process for cross-talk (such as frequent group meetings or mailing list), and while there are multiple frontiers open, each is coordinated from the top. This clearly works for a lot of people, but not for me. Does that say anything about the kind of researcher I am, or others are?

As an addendum: it's worth briefly mentioning the “agile research groups” idea, one example of which is Scram of Mike Hicks and Jeff Foster. Eric Eide also mentioned to me he uses some of these ideas, to varying degrees of success, in the Flux group at Utah. Coincidentally, I recently dropped in on both these groups! I think these techniques are mostly orthogonal to the cathedral-versus-bazaar issue: they concern the manner (frequency, duration) of communications, not the topology. I expect Scram works best when participants have a common goal, i.e. there may also be tighter topic-coherence requirements on its suitability. These may perhaps even be more likely to hold in a cathedral-style group, although there is certainly no hard-and-fast causal relationship there.

[/research] permanent link contact

Wed, 14 Dec 2011

Heterogeneity or homogeneity: what's the problem?

My attention was recently drawn to a problem that some web developers call the “language heterogeneity problem”. I'm not sure where the term comes from; in fact it is not as widely used as I was led to believe. But still, most people who have done web programming know that there are a lot of languages that people use for the web, not usually out of choice per se, and that this is somehow a problem.

The phrase “language heterogeneity problem” immediately jarred with me, since some of my work has been looking at heterogeneity of language as a goal, not a problem. Surely, we want to choose the best language for each part of our program, and not pay any unnecessary cost when doing so? Of course, the problem is about choice versus imposition. It's not that the ability to use multiple languages is bad. It's that in any given context, we don't have that ability! Consequently, you're forced to use particular languages for a given piece of code. This is the true “heterogeneity problem”. I'd couch the problem as lots of small homogeneity problems, not one big heterogeneity problem.

One of my recent student project ideas, sadly not yet attempted (or indeed advertised), is to develop a compiler back-end and run-time library that would let us compile vanilla C programs into web applications. So, for example, if I do printf() it will write some text to the page, and if I do fgets(..., stdin), it will generate a form field whose submission action is to activate the continuation of the program. There are some interesting extensions to this project. How do we partition a program into its client- and server-side halves, or its client-, server- and database-side thirds? Can we tune the partitioning given a set of requirements for security, interaction latency, client code size, and so on?

(There is also an interesting converse to this problem. Most programming languages' standard libraries are designed around a Unix-like model of I/O. And the first programs we teach---like the Hello World program---use this facility explicitly, by printing to streams and reading from them. But we now live in a world where most familiar forms of computing don't have an obvious terminal- or stream-style of I/O evident in their interface. So perhaps clinging to these examples is creating a barrier in front of potential students---who won't relate to the idea of a program doing I/O through a terminal?)

At SPLASH, I discovered that one chunk of my proposed student project effort has been scooped by Emscripten, an LLVM-to-Javascript compiler. However, since writing such a compiler would be too much work for a single project anyway, this might actually be helpful in enabling a single student project to achieve more working stuff. In other words, they could focus on matters other than the compiler, or on doing interesting domain-specific analyses on user code. Alternatively, perhaps a keen student could try to make their own compiler that does a better job than Emscripten, in some way. Hopefully I will managed to advertise the project in time for the next academic year.

[/research] permanent link contact

Tue, 13 Dec 2011

Load addresses

For reasons I will only hint at, I want to predict the load address of a set of shared libraries, given an executable that links against them. Naturally, I have turned off address space layout randomization.

At first, I thought I could use ldd for this. It seems to work.

$ ldd /usr/local/src/git-
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

But also, there is an environment variable called LD_TRACE_LOADED_OBJECTS that is supposed to have the same effect. As it happens, ldd is just a small wrapper script which sets this variable and invokes the dynamic linker, which on my system is /lib64/ld-linux-x86-64.so.2. Let's try doing this directly.

$ LD_TRACE_LOADED_OBJECTS=1 /usr/local/src/git-
        linux-vdso.so.1 =>  (0x00007ffff7ffb000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7bc4000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff79a7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7608000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7ddc000)

That seems to work too. But wait! It's given us different load addresses than ldd did. Have I really turned off randomization? Well, yes. In fact, repeating either of these commands will reliably yield the output above, and they are reliably different from one another. What is going on?

Let's hack ldd so that it prints exactly what command it is going to execute.

$ ldd /usr/local/src/git-
About to eval:  LD_TRACE_LOADED_OBJECTS=1 LD_WARN= LD_BIND_NOW= LD_LIBRARY_VERSION= LD_VERBOSE= /lib64/ld-linux-x86-64.so.2 /usr/local/src/git-
verify_out is 
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

So, it has set a bunch of other environment variables to empty strings. They look innocuous enough. But also, it is invoking the loader directly, whereas we were just letting execve call the loader for us. Can we reproduce the result of ldd by running the same command it does?

$ LD_TRACE_LOADED_OBJECTS=1 LD_WARN= LD_BIND_NOW= LD_LIBRARY_VERSION= LD_VERBOSE= /lib64/ld-linux-x86-64.so.2 /usr/local/src/git-
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

Yes, we can. Now, the big question: which one is correct? Let's run our program under gdb and inspect the memory map.

$ gdb --args /usr/local/src/git-
... snipped ...
(gdb) break main
Breakpoint 1 at 0x404bf0: file git.c, line 509.
(gdb) run
Starting program: /usr/local/src/git- 
[Thread debugging using libthread_db enabled]

Breakpoint 1, main (argc=1, argv=0x7fffffffddd8) at git.c:509
509     {
(gdb) print getpid()
$1 = 27023
(gdb) shell cat /proc/27023/maps | grep 'lib.*\.so.*'
7ffff7608000-7ffff779d000 r-xp 00000000 08:07 356590                     /lib/x86_64-linux-gnu/libc-2.13.so
... snipped ...

So, libc-2.13.so has been loaded at address 0x7ffff7608000, which is what we got from running with just the LD_TRACE_LOADED_OBJECTS flag set, and not what we got with ldd or with invoking ld-linux.so.2 specially.

Why the difference? Clearly, first execveing the loader perturbs the address assignment. It's not clear why this should be---isn't the loader itself the first thing to be loaded anyway? I'm not yet sure what is going on.

Another question: is predicting the load address even a sound thing to do? Given that we had to disable randomization in the first place, it seems like a bad idea. In my case, this approach will do for now, but ultimately I should defer my work until application start-up time. Then we can discover the actual loaded addresses of the various libraries, which is much more robust.

[/devel] permanent link contact

Mon, 05 Dec 2011

Refactoring refactoring

A little while ago I embarrassed myself in conversation by blurting out a sceptical opinion of refactoring. In this post I'll explain some opinions on refactoring and related research, hopefully in a more considered and coherent manner that I managed on that occasion.

I admit that my inclination to be a bit negative comes from prejudice, with a couple of origins. One is that a while ago, I had to fend off (rhetorical) claims that refactoring solved what my PhD was doing. It clearly didn't, but then again, it was well worth writing an explanation of why not. (These claims were from my supervisor, not my examiners, thankfully, and I think were being advanced rhetorically.) In that context, interface evolution scenarios were the issue. I still contend that refactoring is not the solution to interface evolution. (Endlessly editing code to stay in sync with “one true” “current” version of an interface, whether with or without the help of refactoring tools, is an unnecessarily burdensome approach; an automated assist for editing code doesn't make that editing work- or risk-free.)

More refactoring

Happily, most uses of refactoring are quite different from interface evolution: they're about the internal structure of code, not “edge” interface details. I'm a big fan of refactoring in these cases. As a practitioner I'd love to have more and better refactoring. In that space, one clear improvement would be refactoring tools for more languages. This doesn't mean starting again; most languages are more alike than they are different. At the moment, the popularity of refactoring serves to cement the Java hegemony. This is my other unreasonable prejudice: I dislike this hegemony, and so refactoring culture is tained by association (unfairly) in my mind. It'd be really nice to have some decent refactorings available for C++, but I'm not holding my breath. That said, I might not know about them if they do exist.

(Aside: the real problem with C++ is not pointer arithmetic or lack of garbage collection or lack of type safety or anything else that people usually trot out; it's complexity. I'll rant about that in a future post. By contrast, Java does well because it's a simple language. Actually it does unfairly well because researchers abuse its syntax tree as an intermediate representation for program analyses. I might rant about that in yet another post.)

That's my practitioner's view over with. As a researcher, I have one qualm remaining about refactoring. Rather than doing research on making refactoring work in more and more scenarios, I want to see some exploration of a few bigger ideas in the same general space. There are ideas that are more powerful, more “revolutionary” than refactoring but have (so far) less currency.

Language-independent refactoring

Language-independent refactoring is an obvious goal. The means by which to achieve it is less obvious. A shared metamodel seems sensible. The need for a shared metamodel is arguably a limitation. But I don't buy that argument! My reasoning is based on the observation that most languages have large sets of features that are cognate. By this I mean they are not just theoretically equivalent in what they can express (or perhaps not at all equivalent in that way), but rather, a human user understands them in the same way. (I wish we had the empirical foundations to substantiate that, but that's another matter.) So if we can capture these, we can probably come up with a metamodel that works well in practice, even if it fails for adversarially-constructed cases.

Conversely, just to throw another controversial claim into the mix: languages that have such a pared-down set of primitives that they don't offer a cognate of some recurring feature---like purely functional languages without mutable objects, or C without virtual calls---in practice do have cognates, but appearing as patterns of use rather than delineated language features. So I seem positing some sort of Chomskyan “universal language feature set” that is in programmers' minds if not explicitly in all languages. That feels a bit strong; I'll have to take some time to see whether I agree with myself there.

(As an aside: of course, many languages have too many features, in that they are mutually cognate: do I use static variables, or a singleton object, e.g. in Java? Do I use typeclasses or traits, e.g. in Scala? Do I specialise using template specialisation, overloading or overriding in C++? These self-cognate features usually have arbitrary limitations, and their diversity exists for implementation reasons. Exposing a choice among them leaks implementation details to the programmer, along with consequent performance characteristics. So, forcing the programmer to select among these early on, as these languages all do, is an evil akin to the evil of premature optimisation. Fortunately, it's an evil that refactoring exists to fight!

(Continuing the aside: we perceive “clean” languages to have few mutually cognate features. Conversely, most mutually-cognate features differ along some potentially-separable dimensions: each feature mixes a particular setting on each dimension. For the “static versus singleton”, having a chunk of data that is “one per program instance” is the main concern, and dynamic-versus-static allocation is the orthogonal issue that is unhelpfully mixed with it. In a Java-like implementation, object identity is another mixed concern: it's something you get in the singleton case, not the static field case, and effectively for reasons of implementation leakage. Conversely, in C and C++, statically-allocated stuff can still have its address taken, so there is better separation of concerns in that case.)

Non-behaviour-preserving transformations

Digging deeper than language-independent refactoring, it seems that refactoring's main value is in its ability to improve code by reversing bad decisions that were somehow expedient earlier. But underneath that, there are two cases. Firstly there are cases where you refactor because you got the abstract design wrong earlier (e.g. you assumed there was only one Frob per Widget, and in fact there might be many). Secondly are the cases where you got the abstract design right, but the code-level design wrong, i.e. you didn't map the abstract design optimally onto language features (with respect to maintainability, efficiency, ...). To me, it feels like there is too much emphasis on the second case, while the first one is harder and more interesting.

I think this is because automated refactorings aim to be behaviour-preserving. But since the two problems are very close---they both arise from forced premature commitment and the programmer's failure to anticipate the future---we should perhaps use the same tools to combat both of them. In other words, the requirement that refactorings should be behaviour-preserving actively limits what we can do. So how about some bolder approaches that might sidestep the problem? While these approaches might violate the letter of the definition of refactoring, for me, they retain the most useful charateristic of refactoring: by a single localised change, we can effect global changes on our codebase.

The only work I know that does automated non-local code edits that can change program behaviour is Coccinelle, based on the “semantic patch” idea. Aspect-oriented programming is a similar technique, but works by effectively (and controversially) delocalising run-time semantics rather than performing non-local code edits. I'd like to know if there are others more like Coccinelle already in existence.

So, suppose we discard the restriction of limiting ourselves to behaviour-preserving edits. One direction under this auspice is to creep closer towards model-driven development. I want the ability to change my “model” (either in my head, or in some modelling notation) and see changes reflected in source code. And also, vice-versa: if we do have a modelling notation, code changes should update the model. This is a hard but interesting problem in bidirectional transformation, which has something of a currency at the moment (witness the BX workshop).

Logic metaprogramming

A final thought is about logic metaprogramming. This is a very cool idea that I have not yet got up to speed on. In fact, almost all I know about it is from the abstract of a paper I saw at SPLASH last year, which said: “In logic metaprogramming, programs are... derived from a deductive database.” But this one sentence is so intriguing that I want to run for a while with what I think it might entail, before I find out what is actually done in existing systems (of which there are few!).

I've often wanted to program not by writing code directly---since I'm often aware that the code I'm writing will probably turn out “wrong” or “bad” once I've done a bunch more coding---but by making a sequence of simpler statements that I have more confidence in. Each statement should be small, freestanding and less of a commitment than writing a line of code would be. These statements might be such that none of them, by itself confers enough information to write a "known good" piece of source code. E.g. I might write that each instance of class A “has a[n]” associated instance of class B, but I don't yet know whether this association should be expressed as pointers, or by some associative data structure, say. This decision could be determined later, by solving constraits originated by other small statements. Ties could be broken (i.e. multiple candidate solutions selected among) by extrafunctional requirements such as performance (which might favour pointers over associative structures).

This is related to program synthesis and refinement methodologies, I guess. But I am particularly interested in making it exploratory. By having a tool explore the implications of the programmer's statements, we can potentially refine our understanding of the problem (a.k.a. “debug the design”) without going through the circuitous path of first writing some “bad” code and then either finding it's not the right design (and cleaning it up by heavyweight code changes) or finding it's incidentally messy (and cleaning it up, just by automatic refactoring if we're lucky). We can also have a tool tell us what the “right way” to code something is, but early. If the only solution to a particular set of requirements is to use a particular language feature, then the tool can tell us this, rather than letting us find it out by making the wrong choice and then backtracking. Of course, we need to get the requirements right up front, so this technique will only ever be a complement to more backtracking-oriented techniques.

Multi-dimensional representations of software

It is a very classical notion that programs have one true form, being their source code in the form of a string of symbols. Refactoring sticks with this idea but tries to make it easier to alter that form, by abstracting and automating certain common complex non-local edit patterns. But we can go further by rejecting the notion of “one true form” altogether, at least in the sense that that form is manipulated by programmers.

Of course, this is the MDSoC holy grail. I think the idea is just slightly too big for its own good, at present. Ironically, or fittingly, it has not been decomposed properly: aspects, refactoring and typeclasses are the main programming weapons that share its spirit, but none has its power or elegance. It's a shame that work on the idea seems to have fizzled out. (It's also a shame that the paper isn't on more reading lists!)

Somewhat relatedly, there's been some interesting work on subjective/transient/dual encodings of language features, as with the registration-based stuff at last year's Onward!, or Rob Ennals' Jekyll. But I'm still not aware of any mature tools that can really rip apart the modular structure of code and transform it on demand. Perhaps one problem is that we need to be able to define what primitive entities these queries “select”, and how they reformulate them into the appropriate bigger chunks---ideally in a language-agnostic way. So it's back to the shared metamodel. Again, better understanding of “cognate” language features, and indeed of less intuitive correspondences between language features (like the nontrivial correspondences between algebraic data types and class hierarchies), will help here.

[/research] permanent link contact

Guided by folklore

In a recent chat with my internal examiner, Andy Rice, I had a few thoughts which I decided to write down. It turns out he reads my blog---along with (his words) “everyone in the department” ---so, hi Andy and everyone. One day I might stop writing as if my audience consists only of myself, but not right now.

In summary, I want to rant about two weird things that go on in the research world. One is that there are some memes that seem to have a real influence on how PhDs are examined, but seem to have no origin other than folklore, and are different from the standards used to judge other research. The second rant, and perhaps the more oft-repeated, is that we actively encourage boring research.

(I should add that although this post is rather ranty, the chat was not an argumentative one. So, this is mostly post-hoc ranting about related topics, and not a direct reflection of our conversation.)

A thesis is judged on criteria from folklore, beyond what applies to “normal” research. At various points in my PhD, I heard it said that “a thesis should... [do X]”. Usually, X was something to do with telling a complete story, strongly substantiating a succinct hypothesis, and so on. And now I have heard the same from my examiners. Unfortunately, these statements continue to be just that---hearsay. They're different from the ways in which other research is judged. There are no regulations or official guidance to support them. There are no clear scientific or moral justifications for them either. The research community happily publishes many papers that do not tick these boxes, and at good venues. My own OOPSLA '10 paper is one example, but there are lots of others. But despite this, PhD examination seems to give a lot of currency to these criteria, for apparently no reason other than their having been handed down through the generations.

During my PhD I didn't worry myself much about this, since, like most researchers, I don't put much weight on unsourced claims. Besides, there seemed to be enough data downplaying their significance anyhow---several other theses seemed to break the rules, and plenty of published, respected research papers did too. Surely if a PhD is training for research, the qualifying criterion should be focused on doing good research? From my very limited experience, and from what I gather from listening to others, this is not how things currently are. Fortunately, I am of the bloody-minded type. I was aware that I might be “creating trouble” for myself, but I personally preferred to risk creating that trouble, thereby at least gathering some evidence about it, rather than swerving to avoid an obstacle that was at best nonexistent (I didn't know it would cause trouble) and at worst, worth challenging. So, consider it challenged! If you think a thesis needs to be anything more or different than good research, I challenge you to justify that position.

Now, on to my second rant. The evaluability problem has an irrational hold on many practical computer scientists, to the extent that research into many important problems is deliberately avoided. I spoke to many experienced researchers about my PhD work as it went along. Several of them suggested that I might have some trouble at examination. This seemed odd to me, for the reasons I just ranted about. Nevertheless, I didn't disbelieve them. But I had no intention of applying the fix they suggested. Rather than developing an alternative evaluation strategy or (the best advice in hindsight) to maximise the persuasiveness of the presentation of whatever evaluation data I did have, the only “advice” I ever received on this point was a not-so-veiled encouragement to abandon my current problem and work on something else. “Up and to the right” was what one researcher told me---about the kind of graph that should be in my evaluation chapter. (My evaluation chapter has no graphs, and is staying that way.)

This attitude is the tail wagging the dog. If a problem is important, and we do some research that is not conclusive, we should damn well work harder at it, not give up. The problems and curiosities of humankind are not regulated by how easy it is to collect data and draw graphs about them. If we avoid working on important but difficult-to-evaluate problems, or discourage such work, it shows the worst kind of ivory tower mentality. It is far from a pragmatic position, despite how (I'm sure) many of its adopters would try to spin it. What is pragmatic about ignoring the real problems?

I'm not downplaying the importance of evaluation. It goes without saying that measuring the value of innovations is important. Moreover, our ability to measure is something we need to work on actively. After all, many of those physicists and other “hard” scientists seem to spend nearly all their time working out ways of measuring stuff. So I'm completely in favour of rigorous evaluation. On the other hand, I'm not sure that a lot of evaluation that currently passes muster is really rigorous anyway. We need to recognise evaluation as a problem in its own right, whose hardness varies with the problem---and make allowances for that. For many hard problems, evaluation of a solution is comparably hard. That shouldn't mean that we give up any attempt to tackle those problems. The preference for conclusive results in published research has a deceptive influence, being essentially the same phenomenon as the “decline effect”, described in this very interesting article from the New Yorker.

There are some other problems with evaluation in particular kinds of CS research. One is what I call “evaluation by irrelevant measurement”: if you develop something that is supposed to help programmers, but you can't measure that, how about measuring its performance or proving its type-soundness? It says nothing about whether you've achieved your goals, but it still ticks those evaluation boxes. And of course we have a big problem with reproducibility of experimental results---at the VMIL workshop at SPLASH, Yossi Gil gave a great talk about the non-reproducibility of VM-based microbenchmarks, and Jeremy Singer's Literate experimentation manifesto was a nice counterblast to the wider problem.

I have found programming language researchers to be more sympathetic than “systems” researchers to work “towards” a goal, as distinct from work telling a complete story about some problem. This is partly because the nature of programming language research makes reliable evaluation a very high-latency endeavour. In other words, until real programmers have used your idea in a large number of projects, there will be no clear experience about how well it works. So, being computer scientists, we mitigate that latency, using pipelining. Rather than a slow stop-and-forward algorithm which waits 20 years between research projects, we have to be more amenable to two approaches: argument, in the sense of paying attention to the reasoning that justifies the approach of a particular piece of work, and speculation, meaning allowing the research discourse to explore many alternative approaches concurrently, and letting time tell which ones will “stick” out of the many that have been given a chance. The job of the researcher is less to conclusively show a problem as solved, but to show that a technique is feasible and has some potential for wide and successful application.

Going back to the first point, perhaps I should add that I'm not saying that my thesis would have stood up any more strongly by “good research” criteria. But having said that, a very large chunk of it appeared at a top-tier venue, so it can't be all that bad. Both of my examiners seemed to miss this fact, so the lesson is: always put a prominent summary of your publications in your thesis! Personally I can be very critical of my thesis work. But it seems bizarre to me that folklore should have so much sway in the way that theses are examined.

[/research] permanent link contact

Thu, 01 Dec 2011

Weak dynamic symbols

Although I know more than the average bear about linkers, there's always things I don't know. Until now I never had cause to understand the following: how does the linker know which symbols to link dynamically, and which to link statically?

A bit of head-scratching reveals the only possible answer. Given a command-line, it does the usual thing: look at the command-line options, perform the standard library look-up procedure, gathering a list of object files---some static, some dynamic. If a symbol is defined by a dynamic library, make it dynamic. Otherwise, it stays static.

That sounds fairly sensible. But it can mean surprises. Suppose you have a C program that wants to support an optional feature to be linked in at load time. You might write something like the following.
int optional_function(int arg1, void *arg2) __attribute__((weak));

/* ... */

void do_something(void)
    if (optional_function) optional_function(42, &some_obj);
    /* else skip the optional part... */

If you pull this sort of trick within a shared library, it works fine. But inside an executable: no dice! If you compile this into an executable and look for optional_function in your dynamic symbol table, you'll be disappointed.

$ objdump -T my-program | grep optional_function

What is going on? Well, it's in the static symbol table, silly.

$ objdump -t my-program | grep optional_function
0000000000000000  w      *UND*  0000000000000000          optional_function

What does it mean to have an undefined symbol in your executable's static symbol table? It means it will silently take the value zero! In fact, the relocation records referencing your symbol have already been discarded.

$ objdump -rRd my-program | grep -A1 -B1 callq
  400549:      bf 2a 00 00 00        mov    $0x2a,%edi
  40054e:      e8 ad fa bf ff        callq  0 <__init_array_end>
  400553:      b8 00 00 00 00        mov    $0x0,%eax

Cheerily, the linker has inserted a direct-bound call to address zero in your code. That's not what we want! So, how can we fix it?

The trick is in the linker's (or at least the GNU linker's) --dynamic-list option. First, create a file called whatever you like (mine's called dynamic-list), containing the following.

{ optional_function; };

Now link your program passing --dynamic-list <your-dynamic-list> to the linker.

gcc -Wl,--dynamic-list -Wl,<your-dynamic-list> -o my-program my-program.c

Hey presto! You should now have your weak symbol in the dynamic symbol table.

$ objdump -t my-program | grep optional_function
0000000000000000  w   D  *UND*  0000000000000000          optional_function

That's a bit ugly. Recalling the linker behaviour I described at the beginning, the simpler way to do it is just to link your executable against a shared library defining optional_function.

You might wonder (as I do): what is the point of putting undefined symbols in an executable's static symbol table? Once the executable is output, it's too late to link anything with them. Surely they should all be “promoted” to dynamic symbols? [Update, 2012-5-19: there is of course a linker option for doing this, which in the GNU case is --export-dynamic. Still, I'm not sure why it isn't the default.]

It would also be nice to have an objcopy option for adding dynamic symbols in this way, so we can do it after the fact, rather than changing the linker command like we did above. However, this is nontrivial for the reason I mentioned---the relocation records that you would want have already been eliminated. So, we would need to re-create them. This is similar to something I began work on before. At some point I might resurrect my objcopy patches and try to repurpose them to this problem. For now, I will just hack in the extra linker options.

[/devel] permanent link contact

Powered by blosxom

validate this page