Rambles around computer science

Diverting trains of thought, wasting precious time

Fri, 13 Jun 2014

Linking and loading: what's incidental?

Not long ago I gave a couple of talks about linking, loading and debugging here in Cambridge. It's a notoriously grungy area of our infrastructure, and a recurring question people asked me after the talks was as follows: what's the incidental complexity and what's the intrinsic complexity? Put differently: how might we redesign the whole lot nowadays, to get a less complex end result?

It's hard to answer, and partly it depends on what you value. On the surface, it seems like linking should be conceptually simple. Digging deeper though, it becomes apparent that there are a lot of concerns involved, and they are often in tension, making it very tricky for any simple solution to satisfy all requirements. That doesn't mean that simpler solutions are not possible, but we should not underestimate the intrinic complexity either.

What intrinsic complexity am I talking about? I can see four concerns central to the design of ELF dynamic linking. One is link speed: linking should be as fast as possible. Another is memory sharing: shared libraries were introduced (in part) to save memory. Another is flexibility, achieved through interposability: it should be possible to customise the link to add or replace features without modifying binaries. The last is federation: ELF is shared by many vendors' toolchains and many OSes, so the ELF runtime necessarily has a descriptive role, which is the basis for cross-toolchain runtime protocols for stack walking, exception handling and the like.

Part of the problem is that many people don't value all four of these concerns simultaneously. Link speed is valued by software distributors (who want to cut down startup times) and people who use embedded platforms (who can't afford overheads), but not workaday developers or, directly, users. Position-independence is valued by people who run the kind of systems where it saves memory, like desktop systems, and not by those who don't, like most servers. (Plan 9's decision not to bother with shared libraries is both reasonable, if you run mostly servers, and unreasonable, if you run mostly large desktop applications.) Interposability is valued by people who write tools and those with unusual requirements—which often entail reimplementing selected parts of core libraries (think malloc())—but not by those who don't. Federation is valued by those who value “openness”—competition among vendors, interoperation of tools, multi-language programming, and so on—but not by those who are comfortable with closed platforms, VM-style designs and/or “one true language”.

Few people care about all things simultaneously, so linking seems overcomplicated to everyone. But all these concerns add to complexity individually, and the tension between them also increases complexity. Link speed concerns create incentives for doing more work in the compiler or compile-time linker. This creates “premature optimisation” phenomena which complicate linker implementations (by adding to the variety of relocations) and also complicate the user-facing parts of the linker (such as getting the link options right). Sharing memory entails position independence; this creates complexity for the compiler author, who must support multiple code models, and the user who must grok the related compiler options. Interposition creates complexity for the linker author and user, and costs in performance since it prevents some binding happening earlier than it otherwise could. Federation adds complexity for the compiler author, who must take care to correctly follow conventions and emit metadata such as unwind information and debugging information. It also brings some cost at run time, such as how a cross-language cross-vendor exception protocol is slower than a custom one.

What design decisions can we rethink to get a system that satisfies all these requirements but with less incidental complexity? I believe that a smarter linker could help. The Unix-like linker design, as evolved to elaborate formats such as ELF, can be described as “smart format, dumb linker”. A key “dumbness” of the linker is that it does not know about instruction streams. Rather, the linking format describes all the ways that relocations can be applied as simple arithmetic on byte sequences. The linker's output is a mostly-deterministic function of its inputs (including linker scripts).

If we had a smart linker, we could have slightly dumber compilers—a win for simplicity since there are many more compiler implementations than linker implementations. We could also have less user-facing complexity: the compiler could just output code according to the slowest but most flexible code model, and the linker could take care of optimisations at the instruction level, including position independence. The key difference between such a linker and the current one is that a linker that can do these things must be able to parse and rewrite instruction streams. Current designs have been crafted to avoid this, but I believe this is a good fit for linking and not the burden that it first appears to be.

A smarter linker needn't yield slower programs than a dumb one. In fact it could speed them up. Link-time optimisation is rarely used at present, and not at all across dynamic linking. Meanwhile, interposition has nontrivial run-time costs in some cases, since symbol bindings must be computed at load time and run time. A smarter linker could use its awareness of instruction streams to deploy a variety of strategies to speculatively bind, cache and unbind references, both in disk images (which can be thought of as a less ad-hoc prelink) and in memory (e.g. inlining calls that currently go through the PLT). The former option can be thought of as link-time deoptimisation: we store on disk a binary that has been adaptively optimised for the common case on the particular system, and then if we find that an unusual link requirement comes along, the linker can produce a more flexible deoptimised version. The idea of “linkers as servers” has been explored before (e.g. in the OMOS linker, from Utah) but not in this particular way.

Compatibility concerns have also greatly complicated linking. Before shared libraries, things were fairly simple. To link against the C library, you link with -lc. This is just a bunch of object code in an archive. Contrast that with now. Shared libraries have been added in a “compatible” way that sits alongside, but doesn't replace, archives. As a result, we use both, and get double the complexity. For example, on GNU platforms we mostly use shared libraries, but libc_nonshared and -lgcc et al complicate all that. Since “compatible” shared libraries could not quite offer a complete solution for replacing the old-style linking, we have bifurcated the previous clean static linking model into a mostly-shared-but-partly-static model. A smarter linker could internalise the goal of code sharing. By rewriting instruction streams at load time, any compiled code is potentially shareable, and it's the linker's job to extract an adequate degree of sharing. There's no need for the user-facing complexity of distinguishing shared from non-shared linking.

Shared libraries themselves must bring some added complexity, since being (logically) shareable is a stronger requirement on a chunk of code than having it work only in a single context. In particular, robust shared libraries require symbol versioning or some other notion of “semantics”, to account for the inevitable evolution of software. Current designs for symbol versioning are themselves a great example of unnecessary complexity. Again, they were designed to be a transparent extension to what already existed—versionless shared libraries. This has yielded an overly circuitous solution. By pushing symbol versions into the symbol name, a lot of complexity is pushed to other contexts. Most irritatingly, cients of dlsym() are essentially broken by it, because they must deal with symbol names munged to include versions (or must switch to using dlvsym()). The “exact match” semantics offered by dlsym() no longer offer the right abstraction when looking up symbols provided by versioned libraries. But this solution was accepted because it didn't syntactically break anything (and semantics are the user's problem, right?). The ideal solution would look up symbols not (just) by name but also (partly) by their semantics.

Interposition is also arguably not a fully realised idea. One problem is that not every symbol is interposable, because of the static/dynamic split I mentioned earlier: some programs, and some parts of most programs, are still linked statically. Moreover, “interposition misses” easily occur where the interposer doesn't interpose on all the def–use edges it's supposed to, For example, in glibc, many library-internal calls to low-level functions like mmap() don't go via the publicly-exposed symbol. Calls that appear to be to open() might actually be linked to a mysterious alias of that symbol, such as __open_2(). Although interposition is still useful, the possibility of “missed” calls and the need to grok undocumented binary interfaces like __open_2() make interposition-based software fragile and expensive to maintain. This could be fixed by a more uniform dynamic linking model and a more semantically-founded notion of interposition. One that might work would be interposition based on code paths, not entry points. Again, implementing such a notion inside a linker would require that linker to understand instruction streams and the control flows they allow. When we ask to interpose on a (function) symbol, what we're really asking is a bit more like to insert a new call-graph node that dominates all those in the interposed-on call graph. (It's a bit more complicated than that, since this alone doesn't solve the glibc problem I mentioned, but this basic idea can likely be extended to yield the interposition semantics we really want.)

Hardware complexity also leaks into linkers. The root cause of complexity in relocation and code models is the diversity of addressing modes offered by hardware. Linker authors and others working at the toolchain level tend to take on trust the good sense of the hardware designers' decisions. Whether this complexity is intrinsic or incidental depends on whether we allow ourselves to rethink the hardware too. For the moment I'm going to take this diversity as a hard constraint, but I'd welcome views from more hardware-minded people.

All this is not to say that old-fashioned complexity creep isn't behind its share of problems. Debugging information is a great example of “limb growth” and is crying out for a cleaner design, ideally one that helps reduce the burden on compiler authors and/or improve the accuracy of the debugging information they generate. For linking proper, though, the old-fashioned Unix model of “dumb linking” has stayed relatively clean. It's been keeping that model, in the face of new requirements such as shared libraries, versioning and so on, that has complicated things. I believe it's no longer the right model. Luckily, we can produce a smarter linker that is nevertheless a drop-in replacement for the standard one, so we have an evolutionary path towards something better.

[/research] permanent link contact


Powered by blosxom

validate this page