Rambles around computer science

Diverting trains of thought, wasting precious time

Tue, 02 Jan 2024

How to “make” a shell script

This post is about the following strange program I wrote recently.

dummy=; define () { true; }
define dummy
echo "Hello from shell; PATH is ${PATH}"
return 0 2>/dev/null || exit 0
endef

.PHONY: say-hello
say-hello:
    @echo "Hello from make; makevar PATH is $(PATH), envvar PATH is $${PATH}"

One of my guilty secrets is that ‘make’ is one of my favourite programming languages. It has persistence and goal-driven incremental execution... so high-level! Although it is string-based, it has much less esoteric quoting behaviour than the shell.

Another guilty secret is that I also write a lot of shell scripts. Despite not really liking the shell as a language, I've grown used to its quirks. When the filesystem is your database, the shell is an essential tool.

A third not-so-guilty secret is that I believe in the power of the one-filer: a piece of software I can deploy as a single short-ish file. The above program is a demo of how to combine shell and ‘make’ code in the same file, increasing the space of one-file programs you can write.

Of course, ‘make’ already embeds the shell, but in a very limited way: recipes embed single-line shell commands within them. Normally you can't (usefully) define shell functions within ‘make’ for example. But now, thanks to my handy trick, you can! Whereas normally you'd have to put them in a separate file that you source from some shell commands in the makefile, the “separate sourced file” can now be the same file as the makefile itself. That's just one benefit to have shell-and-‘make’ code bundled together.

This implied use of source is why the shell code ends as it does: if we're being sourced we want to return rather than exit. Since return will provoke a complaint and an error status if we're not being sourced, we silence the complaint and fall back to exit.

Why does all this work? The shell is more-or-less line-oriented: it reads your script incrementally and will not read further in the file than it needs to. So, the file itself can easily contain non-script data at the end, following an exit command. The self-extracting shell script is a well-known trick exploiting this fact.

‘make’, however, as an interpreter is file-oriented: it will read a whole makefile before it does anything. That would seem to be a problem if we want to embed non-‘make’ content in a makefile. However, there is a way.

The first three lines consist of punny shell-or-make code. It means something different, but valid, to both the shell and ‘make’. These meanings are such that ‘make” will ignore the shell code and the shell will ignore the ‘make’ code. The main trick is use of the multi-line lazy “define” feature of ‘make’. This wraps around the shell code. Even if the shell code contains things that look like make expansions, the laziness ensures they will not get evaluated. Rather, ‘make’ just scans forwards looking for “endef”, which appears only after the shell has exited.

Conversely, to allow the shell to pass over this “define” decoration harmlessly, we create a function called define (supplanting any command of the same name that might be on the PATH). The ‘;’ trick exploits how

blah=x; something

associates differently in ‘make’ versus the shell. In ‘make’ the part ‘; something’ is included in the string that is assigned, since semicolons are not a terminator. In the shell they are, allowing our something to be a definition of a no-op function called define, which we use to make the shell skip over the define intended for ‘make’. Similarly, define in ‘make’ is a multi-line construct and the body is expanded only lazily (i.e. not at all, since we don't use it). We use this to skip over an arbitrary number of shell lines.

Before landing on this solution, I tried various things that didn't work. These included variations using the form $(eval ...) which is syntactically valid in both ‘make’ and the shell but has slightly different semantics. It didn't work because ‘make’'s eval is tricksy. Unlike shell expansion, we can't use it to inject arbitrary strings or even arbitrary token sequences; it seems only to work when if it generates a whole phrase of valid “make’ syntax. This means that we can't, for example, have the shell skip over the define simply by having define be expanded from an application of ‘make’'s eval'd that expands to empty in the shell.

Why do I want all this? My motivation was writing moderately simple web applications that could be hosted easily on a machine where I have only basic privileges. If I can run CGI scripts, I can use the shell and the filesystem. I may not be able to (or want to) run PHP or similar. And the same goes for databases... even if I could run a database, they're a lot less useful than files in our current (impoverished) software ecosystem. Programs have to be written specially to interface with a database, but files have the nice property that most programs know how to access them. My application's data doesn't need to be “exported” to them, which anyway brings the complexity of copying. Instead I can have direct access to files that my web application also accesses. My demo application is a form-filling system a bit like Doodle, but called Doddle. Instead of a database, I have files. Instead of PHP, I have the shell. No special privileges are needed, beyond the ability to run CGI programs. More about that anon....

[/devel] permanent link contact

Mon, 11 Sep 2023

A process by any other name

On Linux, what is a process's name? I've found this to be more complicated than I thought it would be. Clearly it's non-trivial because it seems top and ps can disagree about a process's name. As usual I have Firefox and some of its worker processes cheerfully burning some CPU for no apparent reason, so let's use one of those.

top - 14:53:48 up 5 days,  3:20, 28 users,  load average: 0.47, 0.82, 0.74
Tasks: 629 total,   1 running, 543 sleeping,   0 stopped,   4 zombie
%Cpu(s):  1.9 us,  0.8 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16064256 total, 15800628 used,   263628 free,    55752 buffers
KiB Swap:  9291780 total,  6922152 used,  2369628 free.  5019028 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 3172 stephen   20   0 2493792 145184  91804 S   6.0  0.9   0:07.80 Isolated Web Co                                              
 1685 stephen   20   0 2693232 264100  95080 S   3.6  1.6   0:28.86 Isolated Web Co                                              
19803 stephen   20   0 13.552g 6.042g 2.603g S   2.6 39.4  37:22.35 firefox-esr                                                  
12266 stephen   20   0 2653868 670136 582816 S   1.7  4.2 103:31.80 Xorg                                                         
...

Here it looks like process 3172 has a name ‘Isolated Web Co’. But this isn't the name that shows up in ps xf.

$ ps xf | grep [3]172
 3172 pts/2    Sl     0:09  |   |               \_ /usr/lib/firefox-esr/firefox-esr -contentproc -childID 218 -isForBrowser -prefsLen 39352 -prefMapSize 227223 -jsInitLen 277276 -parentBuildID 20230403141754 -appDir /usr/lib/firefox-esr/browser 19803 true tab

Nor is it what is visible in /proc/pid/cmdline.

$ cat /proc/3172/cmdline | tr '\0' '\n'
/usr/lib/firefox-esr/firefox-esr
-contentproc
-childID
218
...

In fact we can get top to flip between these views by presssing the ‘c’ key.

       -c  :Command-line/Program-name toggle
            Starts  top  with  the last remembered 'c' state reversed.  Thus, if top was dis?
            playing command lines, now that field will show program names,  and  visa  versa.
            See the 'c' interactive command for additional information.

So the process name is not the same as its argv[0], and in our case it is the string ‘Isolated Web Co’. I'd like to know where this string is stored. It seems this is called the “program name”, at least by the top manual page. Later in the manual page it talks about “command names” as the same thing as “program names” and observes that even a process without a command line, like kthreadd, has a program name. So where does it live? The /proc filesystem does know about these names.

$ cat /proc/3172/status | head
Name:   Isolated Web Co
Umask:  0022
State:  S (sleeping)

In libbsd there is a setproctitle() call. It says it affects the name used by ps, which for us was just the command line, not the other name. We can request that ps show this name, although this seems not to be the default.

$ ps --pid=3172 -o comm
COMMAND
Isolated Web Co

And indeed a simpler way to get it to show this is ps -a (the minus is significant!).

$ ps -a | grep [3]172
 3172 pts/2    00:00:47 Isolated Web Co

This actually checks out. The use of - (dash) in ps options signifies “Unix options” and not BSD options, whereas it is the BSD setproctitle() call that claims to change the name used by ps. The name used by a “BSD options” invocation of ps, such as ps a, is indeed just the command, not the program name. So it looks like on BSD, historically there was just one “program name” and it was stored at argv[0], but somehow a separate “program name” has also arrived in Linux.

Glancing at the implementation of setproctitle() in libbsd, it seems more-or-less to overwrite the contents of the buffer pointed to by argv[0]. This is a behaviour I consider broken and had previously reported as a bug in Chromium, since it may overwrite existing command-line arguments besides the name (when overwriting the argv[0] string with something longer) and/or erase the framing of them. The latter is, or was, the problem in Chromium's implementation, which concatenates all arguments, joined by a space, into one big new argv[0] string. This renders ambiguous the case of spaces within arguments. (It's also unnecessary, since the goal of the exercise is presumably to update argv[0] while leaving later arguments untouched. Although I haven't tried it, if the new argv[0] string doesn't fit in place, can presumably just be stored elsewhere and argv[0] updated to point to it. In extremis, as we'll see shortly, it is possible for the entire argvenviron structure to be rebuilt and re-declared to the kernel... although this requires permissions that an ordinary process typically does not have.)

The source code of the BSD function's reimplementation in Chromium reveals the intricacies of this apparently simple operation on Linux. Linux has overhauled its handling of process names, introducing some bugs that Chromium needed to work around between kernel versions 4.18 and 5.2. The latest major kernel commit affecting this explains more about how things are supposed to work. It splits the logic into two cases: in get_mm_proctitle, with the comment...

    If the user used setproctitle(), we just get the string from
    user space at arg_start, and limit it to a maximum of one page.

... and in get_mm_cmdline with the following confusing comment:

    We allow setproctitle() to overwrite the argument
    strings, and overflow past the original end. But
    only when it overflows into the environment area
    .. and limit it to a maximum of one page of slop.

But all this seems to suggest that setproctitle() will change both the “command line” and the “process title”. Yet what we're seeing is that the two exist independently: one says Isolated Web Co, the other firefox-esr .... What is going on?

A big of digging led me to discover that there is a 16-byte field comm in struct task_struct roughly here. Unlike argv[0], this exists entirely on the kernel side. The comm field turns out to be the “process title” that top is (optionally) displaying—but actually it's a thread title, since there is one per schedulable task.

Use of strace reveals that top itself reads it from the /proc filesystem. Could this by the only interface on Linux that exposes the “process title” as distinct from the command line? Not quite! There is also prctl(PR_GET_NAME). (Interestingly, the PR_SET_MM family of prctl() calls also let you update the various kernel-side pointers that mark the bounds of the argv and envp vectors and their backing character data. You can also modify the position of brk, and so on. And one can even change the whole lot in one shot by passing a struct prctl_mm_map, although only if we were built with CONFIG_CHECKPOINT_RESTORE. A process needs to have CAP_SYS_RESOURCE if it is to be allowed to perform any of these updates. This makes sense: being able to completely re-plumb your process's command line and environment like this is a feature rarely needed, but would be needed in a “user-space exec()” or image replacement of the kind a checkpoint-restore system might perform. Meanwhile, tweaking the argv bounds could be seen as modifying the stack size, which might need resource-override permissions. (I find the latter a bit tenuous, since it is really just setting two kernel-side pointers, and they only usefully point into memory that still has to be allocated by the calling process.)

Going back to the mysterious Linux kernel distinction between get_mm_proctitle and get_mm_cmdline, confusingly, get_mm_proctitle still has nothing to do with the thread name (the 16-character thing), which top called the “process name”, which a reasonable person might think is the same as the “process title”. It's just an alternative semantics for reading from the argv area: instead of limiting the read to the argument strings according to their original bounds, it allows overflow into the environment area. In both cases it's only reading the command line, never the 16-byte comm string. It exists only to make allowances for how a setproctitle() call might clobber the environment strings, thereby tacitly redefining a part of that region as part of the argument area.

One might also wonder why on detecting this clobbering, Linux doesn't take the opportunity to move the process's env_start and arg_end pointers, as if the necessary prctl() calls had been made as discussed above. That might be considered “too clever” I suppose, and risk creating further changes observable to certain user-space programs that might assume the status quo.

All this has consequences for a user-space runtime that wants to understand what is stored at the top of the initial stack, in the argument strings area. When a BSD-style setproctitle() happens, the effective boundary of the argument strings may change. So, remind me to patch liballocs to override setproctitle() with something that does the necessary extra bookkeeping. That probably means tracking separately the kernel's arg_end and “the place the kernel will claim is the end of argv[0]”. And of course for setproctitle() calls that don't overflow into the environment strings, we might still want to track when they've trashed argument strings.

Just to make things more confusing, the implementation of setproctitle() in libbsd does not seem to overwrite the argument strings in all cases. I can successfully set the process title to a longer string without it spilling over into later arguments. I could dig into this, but my guess it it's using the “separate storage” trick I mentioned as an option for Chromium, above... i.e.  it's no longer doing the overwriting that the kernel's heuristic approach is hacking around.

Of course, after finding all this out I found a nice blog post by Tycho Andersen covering basically the same material.

[/devel] permanent link contact

Mon, 12 Dec 2022

Interoperability: what's rich is still poor

Is there good interoperability with C from C++? I could have named any two languages, and for most pairs the answer would be a clear “no”. But in the case of working with C code from C++, the situation is a lot better than normal. In this post I'll observe why it's still not good, via the lens of a particular programming task I carried out recently. I'll end with some notes on what I think good interoperability ought to mean.

My task was writing some code to target a C-style plugin interface. (It's the plugin interface of the GNU linkers, as it happens.) It follows a common C style, which is to write a bunch of callbacks and ‘register’ them.

enum ld_plugin_status new_input_handler(const struct ld_plugin_input_file *file)
{
    fprintf(stderr, "new input handler called ()");
    // ...
    return LDPS_OK;
}
enum ld_plugin_status cleanup_handler(void)
{
    fprintf(stderr, "cleanup handler called ()");
    // ...
    return LDPS_OK;
}
...
enum ld_plugin_status onload(struct ld_plugin_tv *tv)
{
...
    register_new_input_handler(new_input_handler);
    register_cleanup_handler(cleanup_handler);
...
}

If these operations need to keep any private state, we have to keep it in globals or at least file-level variables—they can be static, a.k.a. private to the file. (A slightly more sophisticated C callback idiom threads an extra void* argument to each callback, with the pointer supplied at callback registration time or at invocation time. This allows the private state to be dynamically allocated. But in this task that's not the case.)

Now let's imagine I have a family of plugins with some commonality. Maybe their handle_foo() is shared. Or maybe one handle_foo() needs to call the basic handle_foo() and then do some additional stuff before or after. The extended plugin should be defined in a separate file, but it can easily refer back to the base one. You can do this all with straightforward link-time plumbing: define a new extended_handle_foo() calling handle_foo(), say.

enum ld_plugin_status extended_new_input_handler(const struct ld_plugin_input_file *file)
{
    fprintf(stderr, "extended new input handler called ()\n");
    // ...
    // now just do the basic version
    return new_input_handler(file);
}

If we need some extended state—a new bunch of variables—then again, our ‘extended’ handler can just be accompanied with some additional file-level variables; and we use link-time reference to get at the basic ones if we need them.

What I've just enumerated are the “static-state” (link-time) analogues of some standard object-oriented patterns: method overriding (defining extended...handler()), delegation to a superclass or similar (linking back to the basic handler), holding state in fields a.k.a. member variables (the file-level variables) and extending that state in a derived object (new file-level variables).

But one quirk is that in this scenario there is no notion of class... each instantiated plugin “object” is implicitly one of a kind. It has its own code and its own data; no way is provided, or needed, to share the same code among many distinct instances of the state. Each piece of code binds directly, at link time, to the state it uses, and what we are building is a single object—albeit perhaps by composition with a base definition.

Since our C extended-plugin code is conceptually extending an object, not a class, there's still one set of state, albeit now split between the file-level variables in the 'base' and 'extension' parts. So when we want an actual plugin that we can run, we instantiate either the base or extended version, by choosing what code we link together. Typically this logic exists in makefiles, say for building plugin-base.so or plugin-extended.so by invoking the linker with different combinations of object files.

C++ classes provide a nicer notation for this “base” and “extension” idea. Maybe we should use C++ to write our plugins? If we did this, our extended plugins could explicitly declare their relationship with the code being extended; there would be no need to invent a naming convention like extended_blah.

struct extended_plugin : basic_plugin
{
...
    new_input(const struct ld_plugin_input_file *file)
    {
        fprintf(stderr, "extended new input handler called ()\n");
        // ...
        // now just do the basic version
        return this->basic_plugin::new_input_handler(file);
    }
};

We could also explicitly declare and group the state that is basic versus the state that is extended, in the member variables of the different classes. And instead of co-opting the linkage namespace and using “static” (local symbols) for the private stuff, there are language features tailored to the specific use cases (protected or private or public).

The semantics of the whole thing are a bit different though. Rather than resolving references at link time, in C++ the binding to plugin state must be dynamic if we're to use these nicer notations. Additional pointers, of course named “this”, are threaded through each member function, pointing at the state to be used. This is to allow multiple instances of a given class. But it means a member function cannot have a signature that is compatible with the plain C function that our callback registration API expects; the extra argument gets in the way. More generally, the fallout of C++'s dynamic binding assumption is that the “static objects” style of interface cannot be modelled using objects in C++. All objectwise interactions need to involve an explicit “this” pointer.

This hinders stylistic interoperability with our static-state plugin interface. We can't simply code against this interface using these C++ features! To use C++ as intended, we would have to change the interface profoundly. For example, in the C code, although a notion of “object state” does still exist conceptually, it is not necessarily localised in memory, so there is no single pointer to it—no natural this pointer. Rather, the linker takes care of placing the various parts of it and wiring it all together; the state might be spread around in memory.

What if we really really want to use these C++ language features to write a plugin usable by the provider of this C-style plugin API? It's not easy, and most people would just fall back on writing C-style code in C++—or maybe just forget C++ entirely.

I'm not most people, so I was not so easily dissuaded. I can see at least two ways to do it: we can package the C++ code (client) in the native style of C (API), or we can package the C code in the native style of C++. Either way, I want to automate the process so that the adaptation burden largely doesn't fall on the programmer. Let's talk about the first approach here: packaging a C++ plugin to fit the C-flavoured style of the API.

To get a C-style “static” “direct binding” plugin, from a C++ class-based implementation which necessarily indirects through a “this” pointer, we need to get plain function pointers that privately access statically allocated state. We can use libffi's closure API for this. For each method, let's generate a closure that can be passed as a plain function pointer. We want to be able to write code something like this.

    MyClass obj;
    auto foo_p = generate_closure(&MyClass::handle_foo, &obj);
    plugin_api_register_foo_handler(foo_p);

Notice the use of a C++ pointer-to-member function. A pointer to member function isn't a closure because it doesn't come with an object. To call it, we have to supply the object specially, via something like this.

    (obj.*handle_foo)(arg);

In our case, once we've generated our closure, we can just do

    foo_p(arg);

because we have bound the handle_foo function together with its object obj. This works by generating a trampoline that bakes in obj as the this argument to the member function, pointing to the object that we identified when we generated the closure. In our case this is a statically allocated instance of MyClass. This does mean our plugin state now resides in a localised object, rather than a rag-bag of global variables; we have had to group our state into a single link-time object definition, but we can still use class derivation to split the declaration of that state across multiple (header) files.

If you can track down the mailing list posts (1, 2) that document libffi's closure API, it's not too hard to get closure generation working. It does, however, involve enough boilerplate that you wouldn't want to write it once per member function of even a smallish plugin. The good news is that in C++ it is possible to use template metaprogramming to generate the libffi boilerplate needed for this. My solution is now in libsrk31c++, my rag-bag library of useful its of C++ code. (I found a page by Murray Cumming especially useful to get my head around tuples and “argument packs”.) I'll walk through it briefly.

We use a class template to group together all the various stepping-stone definitions we'll need, all parameterised by the class whose method we're turning into a closure, and the signature of that method.

/* Closure glue code specific to a particular class type and member function
 * signature is generated by this template. */
template <typename ClassType, typename RetType, typename... MemberFunArgs>
struct ffi_closure_s
{
    /* This is the type of (a pointer to) the member function that we
     * want to generate a closure for. It's any member function. */
    typedef RetType (ClassType::*MemberFunPtrType)(MemberFunArgs...);
...

The main piece of code we want to generate is a function that conforms to libffi's expectations. All generated closures do a similar thing: they slurp their arguments from the calling context's stack/registers, pack them into a block of memory, create an array of pointers to that memory, and dispatch to a function that looks like the below. That function then must unpack them and do the actual thing the closure is intended to do, e.g. our member function call, then write the return value on the end of another pointer. It's a lot of packing/unpacking.

    template <MemberFunPtrType member_fun_ptr>
    static void
    /* FN will receive the arguments
     * CIF (the original cif),
     * RVALUE (a pointer to the location into which to store the result),
     * AVALUE (an array of pointers to  locations holding the arguments) and
     * DATA (some user defined data that the callee can use to determine what
     * do do about this call.)
     *
     * It might look like this, to dispatch to a concrete C++ member function.
        int ret = reinterpret_cast<myclass *>(data)->myfunction(
            *reinterpret_cast<float*>(avalue[0]),
            *reinterpret_cast<bool*>(avalue[1]));
        memcpy(rvalue, &ret, sizeof ret);
     */
    the_fun(ffi_cif *cif, void *rvalue, void **avalue, void *data);
...

To unpack arguments from the array of pointers into our actual bona fide C++ member function call, we use some template magic. First we generate a std::tuple from the array. Doing so is deceptively simple thanks to std::tie and the miracle of the ‘...’ pattern expansion operator. Our caller will instantiate ‘Is’ with an ascending sequence of integers from zero. (Of course, there'll be a way to generate that too.)

    template <std::size_t... Is>
    static
    std::tuple<MemberFunArgs...>
    get_tuple(void **avalue, std::index_sequence<Is...> seq)
    {
        return std::tie<MemberFunArgs...>(
            *reinterpret_cast<MemberFunArgs *>(avalue[ Is ])...
        );
    }

In the above, at first it's a bit mysterious how seq gets expanded in unison with MemberFunArgs. I think this is just the semantics of ‘...’-expansion of expressions with two or more template argument packs (here Is and MemberFunArgs)—they get expanded in lock-step.

Now we need a helper that can call a function given a tuple, i.e. make it into a bona fide argument pack. Again the ascending sequence of integers helps us, this time getting the nth item from the tuple. Notice the use of std::index_sequence to generate the needed ascending sequence of integers.

    template <std::size_t... Is>
    static
    RetType call_member_fun_with_tuple(
        ClassType *obj,
        MemberFunPtrType member_fun,
        const std::tuple < MemberFunArgs ... >& tuple,
        std::index_sequence<Is...> seq
    )
    {
        return (obj->*member_fun)(std::get<Is>(tuple)...);
    }

Now we plumb these together into the actual function we promised in the forward declaration above. (Wart: the code we saw above doesn't actually work as a forward declaration! Forward-declaring templates is clearly beyond my C++ pay grade right now, so in the real code I skip the forward declaration... but I hope it was useful for exposition.)

    template <MemberFunPtrType member_fun_ptr>
    static void
    the_fun(ffi_cif *cif, void *rvalue, void **avalue, void *data)
    {
        ClassType *obj =  reinterpret_cast<ClassType *>(data);
        *reinterpret_cast<RetType*>(rvalue) = call_member_fun_with_tuple(
            obj,
            member_fun_ptr,
            get_tuple(avalue, std::index_sequence_for<MemberFunArgs...>()),
            std::index_sequence_for<MemberFunArgs...>()
        );
    }

If we're dynamically creating closures, how do they get destroyed? It seems reasonable to use std::unique_ptr. One quirk here is that in the libffi API, the closure pointer (the start of the allocated blob) is distinct from the code pointer (the actual callable instructions). It's the closure pointer we need to free, but it's the code pointer that is most useful to client code. For now, we abuse the deleter... we use the deleter object to remember the closure pointer and any other state we need. Then the unique_ptr we give to clients can point directly to the function, and we can give it a pointer-to-function type.

    struct closure_deleter {
        ffi_closure *closure;
        std::unique_ptr< vector<ffi_type *> > typevec;
        std::unique_ptr< ffi_cif > p_cif;
        closure_deleter() : closure(nullptr), typevec(nullptr), p_cif(nullptr) {}
        closure_deleter(ffi_closure *closure,
            std::unique_ptr< vector<ffi_type *> > &&typevec,
            std::unique_ptr<ffi_cif>&& p_cif)
         : closure(closure), typevec(std::move(typevec)), p_cif(std::move(p_cif)) {}
        void operator()( RetType(*fp)(MemberFunArgs...) ) const
        { if (closure) ffi_closure_free(closure); }
    };

Now we actually have the ingredients necessary to call libffi and set up a brand new closure, passing the relevant template instance that generates our the_fun.

    template <MemberFunPtrType member_fun>
    static unique_ptr< RetType(MemberFunArgs...), closure_deleter > make_closure(ClassType *obj)
    {
        RetType(*fp)(MemberFunArgs...);
        ffi_closure *closure = reinterpret_cast<ffi_closure*>(
            ffi_closure_alloc(sizeof (ffi_closure), (void**) &fp)
        );
        ffi_status status;
        auto p_atypes = make_unique<std::vector<ffi_type *> >(
            (std::vector<ffi_type *>) {ffi_type_s<MemberFunArgs>::t()...}
        );
        auto p_cif = make_unique<ffi_cif>();
        status = ffi_prep_cif(&*p_cif,
            /* ffi_abi abi */ FFI_DEFAULT_ABI,
            /*unsigned int nargs */ p_atypes->size(),
            /* ffi_type *rtype */ ffi_type_s<RetType>::t(),
            /* ffi_type **atypes */ &(*p_atypes)[0]
        );
        if (status != FFI_OK) return nullptr;
        status = ffi_prep_closure(closure, &*p_cif, &the_fun<member_fun>, obj);
        if (status != FFI_OK) return nullptr;
        return std::unique_ptr< RetType(MemberFunArgs...), closure_deleter >
            (fp, closure_deleter(closure, std::move(p_atypes), std::move(p_cif)));
    }
};

The code is a bit rough and ready. I'm not sure it's free of resource leaks, and in our rag-bag of state on the deleter, since we need the cif and type info to live as long as the closure itself lives, there are rather extravagantly two separate heap allocations. Anyway, it'll do for now and is much nicer than using libffi directly!

What have we really learned from the above? Even though on paper, C++ and C have “good” “interoperability”, the facilities one might expect for taking an interface defined in one language and using it from the other are not there “out of the box”. We just had to build one for ourselves, and the result looks quite outlandish. Even I would admit that dynamically generating trampolines using libffi is pretty hairy and not very efficient.

I mentioned a second, dual way to do it. This would work instead by fabricating a C++ view of the plain-C state, by generating a class definition that includes the relevant state and functions, and using a linker script to group the state contiguously so that its starting address can be the this pointer and its linked layout matches the struct-level layout of the class definition. That's even more hairy and I haven't attempted to make it work yet, although it's the sort of thing my dwarifdl project can help with, specifically the dwarfhpp tool that can generate C++ class or struct definitions from binary interfaces (albeit not covering this particular case yet).

I also mentioned that calling the original API “C-style” is an abuse—it's valid C++, and other styles are common in C. I'd go so far as to say that languages are not the most illuminating way to look at interoperability problems; we should think about style instead. Style is about choices; languages are about hard rules. I'm caricaturing slightly, but a PL-minded person would observe the mess we saw above, with static versus dynamic “object” notions, and say that the answer is to define a new language in which we eliminate the in-hindsight-foolhardy distinction between statically and dynamically created state. Gilad Bracha—there he is again!—has done this in Newspeak, for example. One can quibble about how best to eliminate that distinction (and a long time ago, before I had met Gilad and also before I had learned to write even somewhat graciously, I did). The language creator's view is one of identifying the bad pattern and then forbidding it! This is natural because doing such things is within a language designer's power.

My take today sidesteps most of that. Rather than eliminating distinctions, I want to embrace them. I am fine with distinctions existing as long as I can easily bridge between them. That is almost never the case today! Yet this change of viewpoint is necessary if we're to have any hope of genuinely good interoperability. Firstly, we need to stop telling everyone to use a different language! Secondly we need to sanction actively the need for bridging, or integration, and for making it easy, rather than eliminating it. We need the maturity to accept that yes, different languages will come along, and they will embed distinctions as their creators see fit, wise or otherwise. In general, even when two languages are concerned with “the same” conceptual distinction, they are still likely pick two somewhat different lines to split it down. For example, many languages have a notion of “constant” or “non-mutable” data, but no two are ever exactly alike in what you can do with them. Trying to define diversity out of existence by creating a perfect language is an unwinnable game of whack-a-mole. We need to embrace diversity.

If embracing “found” diversity is step one, what is step two? It's about asserting your right to be different! In the main part of this post I hope I convinced you that the apparent “good” “interoperability” between C and C++ isn't really good at all. You can call any C from C++, but not vice-versa... and even when you do it, you do it by writing C! Good interoperability would instead mean writing unapologetic, idiomatic C++. Then there needs to be a way for the system to map it to the other interface, here the C-style one. This is basically never possible at present, for any pair of languages. For example, many FFI approaches, such as Python CFFI, effectively do the same thing that C++ did with C: they embed C in the higher-level language. Although it may be fractionally less painful to write the resulting Python-style C than writing the C code directly, this still isn't interoperability! It's just C in disguise. “Including one language in the other” is always is a poor person's interoperability. It has proven successful for C++ because the alternative is even worse. Step two means going beyond that. When a coding task involves using code written in another language, we should be able to code in the native style, not just its surface syntax.

For a preview of what this “stylistic interop” might look like for Python, you could have a read of the VMIL workshop paper Guillaume Bertholon and I wrote a few years back, about Guillaume's very impressive internship project with me at Kent. The concept is a CPython extension module that makes native libraries accessible in a natural Pythonic style, with no FFI code required. Among many smaller problems, the main problem right now is that to maintain reference counts, we have to instrument the native code with a pointer write barrier; that is neither convenient nor fully reliable, and it slows things down more than necessary. However, I have some ideas on how to do better, and am very interested in reviving this strand of work—please get in touch if interested. Prospective PhD students might like to know that I do have a funded opportunity (caveat caveat! it's harder to fund non-UK students with these but it can sometimes be done).

[/devel] permanent link contact

Mon, 10 Oct 2022

Understanding C99 inlines

For a long time I struggled to remember the rules for using inline in C. I think I've cracked it now, though. As often with C, the trick is to think about how it's implemented.

It doesn't help that in C++, the programmer-facing semantics of inline are much more straightforward. If you want to hint that a function should be inlined, put inline (or, for a member function, define it within its class).

inline int foo() { return 42; } // this is C++, not C!

The toolchain takes care of the rest. That's because C++ implementations demand a smarter linker that can provide “link once” semantics, sometimes called COMDAT, for inline functions. Each compilation output includes its own out-of-line copy of the function, and all but one are later thrown away by the linker.

In C99 and later, inline has been specified, but in such a way that the linker need not support COMDAT. The resulting rules are a bit weird—they care about cases where we have multiple declarations of the function, and whether we use inline and extern consistently across each of them. We can still write things like this....

inline int foo() { return 42; } // also C! but semantics different from C++...

... but what they mean is different than in C++. In what follows, I'll dig into this in rather too much detail.

C inlining

The meaning of our snippet above is valid C, but its meaning varies depending on what else appears in the same translation unit. To quote from the C11 spec....

If a function is declared with an inline function specifier, then it shall also be defined in the same translation unit. If all of the file scope declarations for a function in a translation unit include the inline function specifier without extern, then the definition in that translation unit is an inline definition. An inline definition does not provide an external definition for the function, and does not forbid an external definition in another translation unit. An inline definition provides an alternative to an external definition, which a translator may use to implement any call to the function in the same translation unit. It is unspecified whether a call to the function uses the inline definition or the external definition.

The above is telling us that if I write

inline int foo();
inline int foo() { return 42; } // also C! but semantics different from C++...

... it's different from

int foo();
inline int foo() { return 42; } // also C! but semantics different from C++...

... and perhaps also different from the following.

extern inline int foo();
inline int foo() { return 42; } // also C! but semantics different from C++...

There is a logic to all this. If we write inline in a function's declaration in a given compilation unit, it's reasonable that we have to provide a definition for it too. We can't inline something if there's no definition. The question is whether that definition is also emitted for out-of-line use. In C++ the answer is “always, although it may get thrown away later”. In C the answer is “it depends on what is declared”.

We can have one or more declarations for the function in the translation unit. Let's consider two cases. First, there's the case where we always put inline on the prototype. That means none of the declarations can be mistaken for a normal externally-visible function. And hey presto, this does indeed mean that such a compilation unit “does not provide an external definition for the function”. The definition we provide is only for inlining.

However, the standard also tells us that this “does not forbid an external definition in another translation unit”. And in fact, the compiler is allowed to assume one exists! As usual, it doesn't have to inline the function in the generated code; it could generate an out-of-line call. So you'd better have another compilation unit that does things differently....

This is where it helps that standard also tells us that “a file-scope declaration with extern creates an external definition”. In other words, by including a prototype declared extern, we instantiate the inline in that translation unit. Of course, for any particular function we should only do this one in one translation unit, otherwise we'll get “multiple definition” link errors.

How do we normally use all this? Typically, we put the inline definition in a header file, and then in a single .c file we include an extern prototype to instantiate the out-of-line copy in that file. Hey presto, we have avoided COMDAT linking, by forcing the programmer to decide which compilation unit gets the out-of-line copy. That's the main idea.

It's pretty confusing though. Normally extern on a declaration means “not defined in this compilation unit”. By contrast, on an inline function it means the opposite. Even aside from that, the whole set-up is a bit wacky in that it's relying on the inconsistent use of decorators to carry information. Normally we'd perceive an inconsistency like this as a code smell and expect the semantics to be, if not an error, then as if the many declarations were somehow merged into a single canonical signature that is “what you could have written”. But here we are required to use multiple signatures just to convey the intended semantics.

One final quirk is that there is also static inline. This is much more straightforward because although it might generate an an out-of-line copy, it would be a local symbol specific to the file in question, so does not cause link-time complexity about missing or multiple definitions. It's a good option if you just want some quick utility functions, but risks code bloat if the functions are large and/or often-used.

Adding GNU C

Of course, things get even more complicated because before C99, GNU C introduced inline with slightly different semantics. These can be seen in a third case where we mark a definition as extern inline (above it was just a declarationt where we put extern). In GNU C, this means the definition is only available for inlining, never for emission as a visible symbol. It's in fact just like prototyping it as inline in C99 and never using extern.

These old-style GNU semantics can still be requested of GCC by a command-line option (-fgnu89-inline), even when ISO dialects are requested otherwise. They can also be chosen on a per-function basis using the gnu_inline attribute. If __GNUC_STDC_INLINE__ is defined, the file-level default is the “standard” semantics, otherwise we get the GNU-style one.

It's common to see codebases use extern inline combined with gnu_inline and always_inline to get the effect of a macro: the inline function is always inlined and never emitted as an out-of-line function. (However, it turns out that even with these GNU semantics, if we have an “extern inline” definition that we elsewhere also declare as “[regular non-extern] inline”, this “only for inlining” property goes away, and an out-of-line instance is generated.)

A pattern that doesn't work

All this complicates how we deal with our header files. Ideally we'd have one set of header files, which we can freely include where we need them, and the issue of where to instantiate out-of-lines is a matter for .c files only. That ideal mostly works: we can use extern inline in a single .c file to do the instantiation.

However, some more complex scenarios turn out not to work. What if we want a function to be “opportunistically inlinable”: some compilation units will have an inlinable definition to use, but others won't and should emit an out-of-line call. This might be because the definition itself is change-prone, so is better left in a private header but not pushed to clients, who should see only the out-of-line prototype in a shared public header. Makes sense? Unfortunately this shared public header sets up an unwinnable situation: if its declaration is marked inline then this will be wrong for the external compilation units, at best generating a spurious compiler warning (inline declaration but no definition). If the public declaration is not marked inline, then all the “private” compilation units, which do include an inline definition, will instantiate the out-of-line copy, causing “multiple definition” link errors later. The only solution I know to this is conditional compilation: define a macro IN_MYPROJ in the private codebase, and prototype the function like so.

#ifdef IN_MYPROJ
inline
#endif
int myfunc(int arg);

External clients then won't know it's inlinable, but a private header can define the function inlineably and a (unique) internal compilation unit can instantiate it in the usual way.

A test suite

I wrote a single-file “test suite” that elaborates the space of inline and extern inline usage and tests what a given compiler does with all of them.

Consider a function with three declarations: the one on the definition, and two standalone ones. Each of these can be independently set to inline, extern inline, or unqualified. This creates 27 distinct cases. In each case, we test at run time whether the function really was inlined, at a test call site (using some funky introspection from within the call). In cases where we expect an out-of-line copy not to be generated, we generate one in assembly, so that any additional copy would cause a link error. (This applies only to some of the GNU cases and the ISO “inline, inline, inline” case.) In cases where this copy is not expected to be called, because inlining should happen at our call site, we give it a body that will abort with an error.

In the actual test file, which you should take a look at if you've got this far, the cases are heavily macroised so they really just look like counting up in a weird ternary number system where the digits are inline,   (empty) or extern inline.

CASEn(1, inline, inline, inline, assert(!OOL_EXISTS); assert(!AM_INLINED); /* UND ref created! */ )
CASEo(1, inline, inline, inline, assert(!OOL_EXISTS); assert(AM_INLINED);)
CASEn(2, , inline, inline, assert(OOL_EXISTS); assert(!AM_INLINED); )
CASEo(2, , inline, inline, assert(OOL_EXISTS); assert(AM_INLINED);)
CASEn(3, extern inline, inline, inline, assert(OOL_EXISTS); assert(!AM_INLINED); )
CASEo(3, extern inline, inline, inline, assert(OOL_EXISTS); assert(AM_INLINED);)
...

Optimization affects the expected semantics, so there are actually 54 cases: each of these 27 tests is instantiated once with optimization turned on at the call site (the “o” case) and once without (the “n” case). We make the simplifying assumption that, since our test inline functions are small, optimization implies that it will be inlined. (I did not attempt to test the GNU always_inline attribute, but its behaviour is basically to swap the expected behaviour from “unoptimized” to “optimized”.) In eleven of the 27 pairs, the expected behaviour is different with old GNU-style semantics (e.g. from-fgnu89-inline) than with ISO semantics, so we use conditional compilation to assert the right things.

If 54 cases sounds like a lot, then arguably they are a bit redundant. Of the 27 pairs, one is degenerate (no inlining at all), and of the others, arguably those that duplicate a set of qualifiers between two declarations are redundant. For simplicity I chose to expand them all out anyway.

The test code itself is pretty unportable, so do read the caveats at the top of the file, and take care that you build it with the necessary link options to allow the funky introspection to work (principally -Wl,--export-dynamic).

I should add that my motivation for fully getting my head around all this when I found that inline semantics in CIL were subtly broken in some cases. In short, CIL's output was not reproducing function declarations faithfully with respect to the input, so it could unwittingly change the inlining behaviour of code as it passed through. I've fixed this in my personal branch, to which I have also added my test case. It's on my list to contribute a lot of my CIL improvements back to mainline. I have a lot of this to do....

[/devel] permanent link contact

Thu, 06 Oct 2022

How to do link-time symbol wrapping... as a plugin

I wrote recently about the problems with link-time symbol wrapping using the --wrap option to GNU linkers and similar. In short, references don't get wrapped in some cases when they arguably should, if definition and use are in the same file. But there is another way, using symbol replacement, that can do such wrapping if we're careful about self-references and the like.

I've now created a plugin for the GNU BFD or gold linkers that implements, in effect, a command-line option for wrapping that is mostly interchangeable with --wrap except that it handles those cases as described. I've tested it with the test cases I documented last time, and it gives the right behaviour. (But that's almost as far as testing has gone!)

It's called xwrap-ldplugin (for “extended wrap”) and is part of my elftin repository. It should be moderately easy to build. Apart from standard libraries it depends only on librunt (build-time dependency only), on the normrelocs tool in the same repository (build that first!), and on the binutils plugin-api.h file that defines the linker plugin interface. Once built, you can load the plugin with -plugin and give it a sequence of -plugin-opt argument strings that are the symbol names to be wrapped, much as if each had been prefixed with --wrap and included on a standard command line.

Caveat emptor! The semantics in respect of shared libraries may be a bit different to --wrap. I haven't tested these cases very well, but I did find one important case where my new approach requires modification. What if the wrapped symbol is defined only in an “input” .so file, so is not going to be defined locally in the output? The muldefs approach works by first adding a __real_ alias of the wrapped symbol, then replacing that symbol with the wrapper. But we can't add this alias to a shared library on the system, only to a relocatable .o file. The solution is just to fall back on --wrap for these symbols, since they are always ones that the standard wrapping behaviour handles properly.

I'm not a fan of plugin architectures generally. They tend to work only for anticipated use cases, and their APIs tend to suit the creators much more than their users. They tend to embed a lot of incidental detail while often forcing clients to reinvent wheels. In this case, much about the linker invocation is not exposed through the plugin API. Since it's a C API not a string API, information from the command line needs to be “forwarded”, or “bound” into it—always a bad pattern. One of the things my plugin cares about is the list of --wrap-treated symbols; we want to inspect this and maybe add to it (as covered above). Only the most bleeding-edge versions of the API even let me get the list of symbols that the command line is asking to --wrap, and there's no way to add to the list. There's also no way to enable the magic -z muldefs option that the new approach relies on. Even bread-and-butter stuff is missing: you can add an object to the link, using ld_plugin_add_input_file, but you don't get to say anything about its position in the command-line link order. This really matters when there are archives involved. I'm not criticising the authors; adding all these features is really work. It's the concept that is the problem.

Sequencing of actions is also a problem with callback-based plugins. Although the current plugin API can tell me what symbols are defined where, at the all_symbols_read event, this happens relatively late on in the link; it might well be too late to add new --wrap options (if that were even possible). In general, it's just the nature of data dependency that one application of the same code might need to know some things earlier than another one does; a fixed sequence of callbacks is asking for trouble. I need to be able to get the contents of the input objects at a time that works for my application, meaning before the set of --wrap options is decided, whereas for the vanilla linker, that ordering constraint just doesn't apply.

(If adding new wrapped symbols after resolution sounds a bit circular—wrapping will change what named symbol is defined where!—that's because it is. Linkers contain some quite subtle phase-ordering problems. I wrote about this in section 5.3 of the REMS project's ELF linking paper.)

Let's think about an alternative. I can think of two. The first is a library API not a plugin API. What's the difference? In short, a library is reactive not proactive. It gives a blank canvas to the client programmer and provides stuff for them to use, in whatever order it likes. By contrast, the plugin gives the client a temporal structure it might not want: specific callback points spliced into an otherwise hard-shell tool. If the linker were itself architected mostly as a library—including routines for, say, parsing command lines—plus merely a one-page main function that glues together the library's primitives in the vanilla way, we would have two useful degrees of freedom. We could write a new one-pager if the vanilla sequence doesn't work for us. Or if it does, we could make do with overriding specific operations that it calls. If these sound suspiciously like common workflows in object-oriented programming, there's a reason for that. Callbacks that add logic let us override in an “around” sense only. This is a subset of the possibilities in a classic O-O inheritance or delegation set-up.

The concept of the GNU BFD is not too far from this library-plus-client-glue idea, but the glue part is more than just glue: all the stuff to do with linker scripts and command lines and so on is still inside the linker client, not a library. One could imagine a library-style refactoring of the linker which does realise this sort of extensibility.

Instead of libraries, another alternative would be to say that the command-line interface is the only interface. Any extension then works as a rewriter of linker invocations. This is a much more black-box approach; it is what an adapter or wrapper script does, and is an annoyingly good fit for the problem at hand here. It satisfies a nice Occam-like property: interfaces should not be multiplied beyond necessity! We're spared from working with a brand new interface consisting of weirdly specific, weirdly limited callback handlers that don't give us the access we need. On the other hand, strings are not as rich as C data structures, and we forgo the potential benefits of sharing data structures with the linker or of fine-grainedly borrowing its working state (such as might be passed to callbacks!). These can all allow us to achieve our ends more efficiently—whether with less repeated work at run time or less repeated code at development time.

Since I was committed to writing a plugin, even if just for fun, I used some hacks to get the “two out of three” solution: using the callbacks as best I could, but also allowing rewriting of the command line. The first hack was to write my own parser of the linker command line, (for want of a library version!), and snarfing the original command line from the auxiliary vector. My command-line code is pretty fiddly, relatively unportable, incorrect in some cases, and will need to be maintained (somewhat) as new options are added. If the existing parser could be isolated behind a library interface, this could be avoided. For now, at least I can get at any information in the command line.

The second big hack, and my biggest hammer against these plugin API shortcomings, is a self-restarting mechanism. This is really nasty, but also pleasing if you like nasty things. Essentially it's doing command-line rewriting from within the linker, to avoid the need for a separate wrapper script. If I detect a condition on which I need to change the command line, such as needing to add a --wrap option, I simply restart the link using execve(), using a command line that is modified accordingly. Care is needed to avoid an infinite loop of restarts.

As an approach to extending an arbitrary program, this is a terrible idea: it might leave behind all kinds of weird filesystem state and partial or repeated outputs. Or maybe we've consumed some inputs and so they're just not there any more on restart. But since a linker is basically a pure function over stored inputs, we can restart at any time reasonably harmlessly. Even if it has already created some or all of the output file (unlikely in this case), it will just be overwritten. Restarting does, however, create a hazard around leaking temporary files: if any part of the linker happened to create a temporary file it expected to clean up later, this may instead be orphaned by the execve(). It is possible to create temporaries defensively so that they will be unlinked by the time any execve() happens; I take that approach in my code, but of course, other plugins' authors, or the linker's own authors, won't know that they need to. So the restarting trick is definitely a workaround rather than the good engineering solution, whic would be to write and contribute the necessary extensions to the plugin API.

So, do feel free to check out the plugin if you'd like to do symbol wrapping, while being aware that it's definitely not production-ready code. Aside from the risk of leaking temporary files (for debugging, it currently even omits to clean up its own temporaries) it completely fails to report various errors. It's still useful though. Contributions are welcome and I will keep making small improvements as time allows.

[/devel] permanent link contact

Wed, 03 Aug 2022

How and why to do link-time symbol wrapping (or not?)

It's sometimes useful to hook functions at link time. On ELF platforms (among others) a GNU-style linker offers the --wrap option which seems to be exactly what we want.

       --wrap=symbol
           Use a wrapper function for symbol.  Any undefined reference to symbol will be
           resolved to "__wrap_symbol".  Any undefined reference to "__real_symbol" will be
           resolved to symbol.

Unfortunately there are several problems that arise when using this. In this post I'll cover some ways to solve these problems. Some of the ways I've found, others I've created. At the end, I'll discuss whether we should be doing any of this.

Problem 1: wrapping file-internal references

As the helpful manual page states, --wrap is not a complete solution.

Only undefined references are replaced by the linker.  So, translation unit
internal references to symbol are not resolved to "__wrap_symbol".  In the next
example, the call to "f" in "g" is not resolved to "__wrap_f".

       int
       f (void)
       {
         return 123;
       }

       int
       g (void)
       {
         return f();
       }

Why not? The text implies that it's because there's no undefined reference to f. This means a relocation record referring to a symbol that is undefined (UND) in the referring file. So, if we move either f or g into a separate file, we can ask objdump -rd to show us the two files.

$ objdump -rd f.o g.o
f.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 :
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   b8 7b 00 00 00          mov    $0x7b,%eax
   9:   5d                      pop    %rbp
   a:   c3                      retq 

g.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 :
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   e8 00 00 00 00          callq  9 
                        5: R_X86_64_PLT32       f-0x4
   9:   5d                      pop    %rbp
   a:   c3                      retq 

As expected, we see a reference to f from g, marked by a R_X86_64_PLT32 relocation record. In this case, --wrap will work perfectly, by binding instead to __wrap_f (assuming we supply one). This works because f is an undefined symbol in g.o.

$ objdump -t g.o | grep 'f$'
0000000000000000         *UND*  0000000000000000 f

By contrast, if we keep f and g in the same file, we get something that looks similar but is actually quite different.

$ objdump -rd testwrap.o
testwrap.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 :
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   b8 7b 00 00 00          mov    $0x7b,%eax
   9:   5d                      pop    %rbp
   a:   c3                      retq 

000000000000000b :
   b:   55                      push   %rbp
   c:   48 89 e5                mov    %rsp,%rbp
   f:   e8 00 00 00 00          callq  14 
                        10: R_X86_64_PLT32      f-0x4
  14:   5d                      pop    %rbp
  15:   c3                      retq 

To see the difference we have to look at the symbol for f that the relocation record is referencing.

$ objdump -t testwrap.o | grep 'f$'
0000000000000000 g     F .text  000000000000000b f

This time it's a defined symbol. That means g's reference to f is already bound to the definition of f. There is no separate identity for “the thing, called f within the code of g, that g calls”. There is just “the definition called f”. Worse, that's directly how things came out of the assembler. Since no link-time action binds this reference, link time is too late to interfere.

This is an awkward non-compositionality when assembling and linking ELF or most other object code: if we group multiple parts into a single file, such as by refactoring what was once many source files into one, at link time we get a less descriptive model of the referential structure of the whole. Binding has occurred eagerly, and we have no way to talk about “references requesting a thing called f” as distinct from “references to a definite thing, namely f”. Since --wrap relies on the distinction, it can't have any effect.

From now on I'll refer to this pre-binding issue as “problem 1a”.

In fact there is another way for --wrap to go wrong, which is if we want to wrap a static (file-local) function. Although wrapping a static function is less common, it might still be desirable. Reliably wrapping local calls to statics is intractable without going back to the source, because they will often have been inlined. But when a static function is address-taken, it might end up being called from far and wide within the program; we might reasonably want to instrument this so that actually, a wrapper is the thing that gets address-taken. Problem 1a guarantees that this cannot possibly work: references to static functions are always intra-file, and therefore to a defined symbol. However, even if problem 1a were solved, we'd have another problem: at the point where we reference static function f, on many architectures, the compiler can use a relative jump or call. Within the same text section, such jumps don't even need a relocation record, so the linker has no clue that reference is even happening. This is problem 1b. Later on I'll discuss ways to solve it, but let's worry about the basic non-static case initially.

Naive solution 1a: symbol unbinding

To solve problem 1a, intuitively we want some way to “unbind” these assembler-bound references. There is no off-the-shelf way to do this (trust me! but I discuss this further below). Way back in 2008 I wrote a patch (unfortunately buggy graduate-student code) to GNU objcopy that extends it to perform unbinding, adding an option --unbind-sym SYM.

$ ./objcopy --help
...
     --unbind-sym             Separate <sym> into  __def_<sym> and __ref_<sym>
...

For wrapping, I would use this followed by a second objcopy to change __def_f back to f and __ref_f to __wrap_f, so that references will go to the wrapper, just as we wanted.

I submitted my patch on the Bugzilla in July 2008. (Intriguingly, to this day it still shows as “NEW”. Clearly the Bugzilla was the wrong place to post it.) Some years later I even fixed the code, integrated it into a forked binutils tree which I then kept vaguely up-to-date for a few years. The story doesn't end there, but it's a viable start. So let's return to problem 1b.

Naive solution 1b: -ffunction-sections and objcopy --globalize-symbol

I mentioned that wrapping a static function is sometimes desired, but that a call to a static function might not even have a relocation record. There's no hope of diverting to a wrapper function at link time if the linker doesn't have any relocation to apply. Consider this code.

static void *myalloc(size_t size)
{
        return malloc(size);
}
void *(*get_allocator(void))(size_t)
{
        return myalloc;
}

If we use objdump -rd on the resulting object code, we see this.

Disassembly of section .text:

0000000000000000 :
  (snip code of myalloc)

000000000000001a :
  1a:   55                      push   %rbp
  1b:   48 89 e5                mov    %rsp,%rbp
  1e:   48 8d 05 db ff ff ff    lea    -0x25(%rip),%rax        # 0 
  25:   5d                      pop    %rbp
  26:   c3                      retq 

Notice the lea whose argument points back to the definition of myalloc. It is just referring to a point earlier in the same text section; no relocation is needed.

The fix is to compile with -ffunction-sections. Since each function gets its own section, calls to a static function are nevertheless exiting the section, so require relocation (sections are the unit of placement in memory). Even references to static functions will come with a relocation record.

(In contexts where you don't control compilation, -ffunction-sections might not be an option, but you could in principle write a tool that walks over the binary, decodes its instructions and re-adds relocs at instructions that address in-section memory locations.)

If we do this, we get the following.

Disassembly of section .text.myalloc:

0000000000000000 :
  (snip code for myalloc)

Disassembly of section .text.get_allocator:

0000000000000000 :
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # b 
                        7: R_X86_64_PC32        .text.myalloc-0x4
   b:   5d                      pop    %rbp
   c:   c3                      retq 

In other words we now have two different sections and a relocated reference between them. If we compile like this, then promote the local symbol to a global, using objcopy --globalize-symbol we should be able to wrap it like any other... job done? Not quite!

Real solution 1b: preferring non-section relocs

Notice that the relocation record above isn't relative to a symbol called myalloc; it's relative to the section symbol .text.myalloc. So the reference won't be unbound if we ask to unbind myalloc! We also can't just name .text.myalloc as the symbol to unbind since (among other problems) the linker doesn't perform wrapping on section symbols anyway.

As a quick fix, I provided a way to eliminate use of section symbols in relocs where a named function or object symbol was available and could be used instead.

$ ./objcopy --help
...
     --prefer-non-section-relocs   Avoid section symbols in relocations, where possible
...

This is progress... although it seems nasty. The nastiness does not end there. Let's keep track of what we've learned about how to handle these tricky questions.

Uses of a definition should be wrapped, so force them to use an undefined symbol, unbinding reference from definition if both are present in the same file.
Uses of a definition should be wrapped, and only ordinary symbol-based relocs can be wrapped, so force these references to use ordinary-symbol-based relocs, not section-symbol-based relocs.

Real solution 1a: refinements to unbinding

Even without worrying about static functions, some similar issues emerged in the course of using my naive unbinding tool. Just as above we worried about “section references”, there are also “self-reference”, and “meta-reference” cases.

Meta-references are (my name here for) references from debugging information, or similar metadata, to “base-level” code or data. This doesn't just affect debugging per se; unwinding information also includes these meta-references, and is critical to C++-style exception handling. Naturally, metadata about compiled code wants to refer to specific places in that code. And naturally, to do so it uses relocation records bound to defined symbols. For example, the .eh_frame unwind information section for a hello.o, when dumped by readelf, looks like this:

00000018 000000000000001c 0000001c FDE cie=00000000 pc=0000000000000000..0000000000000017
  DW_CFA_advance_loc: 1 to 0000000000000001
  DW_CFA_def_cfa_offset: 16
  DW_CFA_offset: r6 (rbp) at cfa-16
  DW_CFA_advance_loc: 3 to 0000000000000004
  DW_CFA_def_cfa_register: r6 (rbp)
  DW_CFA_advance_loc: 18 to 0000000000000016
  DW_CFA_def_cfa: r7 (rsp) ofs 8
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop

... but actually the binary also includes relocation records for the section, to refer to the actual address of its text section.

Raw data contents of section .eh_frame:
  0000 14000000 00000000 017a5200 01781001  .........zR..x..
  0010 1b0c0708 90010000 1c000000 1c000000  ................
  0020 00000000 17000000 00410e10 8602430d  .........A....C.
                        20: R_X86_64_PC32       .text
  0030 06520c07 08000000                    .R......      

This is what ensures that if you dump a fully linked hello binary, the line showing pc=0000000000000000 above will instead actually mention the real address of the main function. (If you're wondering: stock objdump can't dump data sections with relocation records interspersed like I've shown above. But my objdump-all script can.)

The whole idea of “unbinding” is to wrench apart bindings to defined symbols. But clearly we don't want to unbind debugging information from the code is describes! We just want to unbind calls and address-takes. Unfortunately, at the object code level, it's not trivial to distinguish these from meta-references. In my patch, the best I could do was the following.

+  /* What is a self reloc? It's either
+   * - in a debug section;
+   * - in .eh_frame (HACK);
+   * - within the dynamic extent of the symbol to which it refers. */
+  bfd_boolean is_debug_or_eh_sect =
+    (bfd_get_section_flags (ibfd, isection) & SEC_DEBUGGING)
+         || 0 == strcmp(bfd_get_section_name (ibfd, isection), ".eh_frame");
...

We then avoid unbinding references coming from a debug or unwind section. For better or worse, BFD doesn't count .eh_frame as a debugging section (certainly one wouldn't want strip -g to strip it) so I had to recognise .eh_frame by name.

Another way one might think we could have got the same effect is simply by exploiting the behaviour that gave us problems with statics earlier: we don't unbind relocations going via section symbols. We saw above that the relocation used the .text symbol. Maybe debugging relocs always use section symbols? If so, we'd be all clear. Sadly, this is mostly true but it is not always true. If you grep the objdump of a large build tree, you'll find some debugging sections using ordinary symbol relocs. We could imagine handling these with a converse of --prefer-non-section-relocs, which for debugging sections prefers the section-symbol-based relocs. I haven't coded this up in binutils, but it would be worth having; let's add it to our rules.

Meta-references should not be wrapped. They are always local within a file, so force them to use section-symbol-based relocs, which won't be wrapped.

One reason I haven't implemented it yet is that we'll soon see some cases it can't help with. (However, actually I have coded it up in a standalone tool; more below.)

As well as meta-references, there are self-references. Many cases of self-reference are actually, like meta-reference, referencing the internals of a function. For example, on architectures with only absolute branches, a goto would generate a relocation record. Again, we don't want to unbind it so that it gets redirected to the same offset within a wrapper function; that would be meaningless! Modern architectures use relative not absolute branches in the direct case. But once we start doing indirect branches we usually do start needing relocations. Here is some GNU C code that generates a self-reference on many architectures (albeit still not x86-64) by taking an internal address.

#include <stdio.h>
int main(void)
{
        printf("A function-internal address: %p\n", &&resume);
resume:
        return 0;
}

On 32-bit x86 we get the following.

Disassembly of section .text:

00000000 <main>
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 04                sub    $0x4,%esp
  11:   83 ec 08                sub    $0x8,%esp
  14:   68 26 00 00 00          push   $0x26
                        15: R_386_32    .text
  ...

At the end of the listing we see the rightmost argument to printf, which is pushed first, being formed by a self-referencing reloc against the .text section.

We clearly don't want to unbind this sort of thing. It's tempting to think that we should just let all self-references go undisturbed. But what about this?

int call_me(void *arg)
{
    if (...) return 0; else return call_me(...);
}

Or this?

int is_me(void *arg)
{
    return (arg == is_me);
}

What should wrapping mean in the case of self-reference via an ordinary symbol? Both of the above are compiled as referencing an actual public definition, via its function symbol, rather than referencing internals via a section symbol. The compiled code for is_me looks like this.

0000000000000000 <is_me>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
   8:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # is_me <is_me+0xf>
                        b: R_X86_64_PC32        is_me-0x4
   f:   48 39 45 f8             cmp    %rax,-0x8(%rbp)
  13:   0f 94 c0                sete   %al
  16:   0f b6 c0                movzbl %al,%eax
  19:   5d                      pop    %rbp
  1a:   c3                      retq 

Intuition suggests that probably the recursive call should be wrapped, but the address take not. But why?

To explore just how tricky the semantics of identity can get, let's divert ourselves to ask: can the is_me function above ever return false when passed itself? On dynamically linked ELF platforms (or in fact any SunOS-style dynamic linking system), the answer is yes, because is_me can be overridden by another object (e.g. the executable or a preloaded library). In such a context, the identity comparison is then done against this other definition, i.e. the function isn't really checking for itself at all. That's a bug: the code really doesn't want to refer to “the active definition of a thing called ‘is_me’”. It wants to really refer to itself, really! To make it reliable under dynamic linking, one would change the code above to instead test against a hidden alias, which can not be overridden.

extern int really_me(void *arg) __attribute__((alias("is_me"),visibility("hidden")));

What about the analogous problem with wrapping? If we really want to reference ourselves, there should be some wrap-proof way to do so. Luckily, in my current objcopy unbind logic, I don't unbind self-references at all.

+              /* Okay, the reloc refers to the __ref_ symbol. Is the relocation site
+               * in the same section as the __def_ symbol? */
+              bfd_boolean is_self_section = nu->def_sym && nu->def_sym->section == isection;
+              bfd_boolean is_self = is_debug_or_eh_sect || is_self_section;
+              if (!is_self) continue;

Here the code treats specially any intra-section references. If we compile with -ffunction-sections, only true self-references will be intra-section. In more general contexts, the code above it will leave alone any calls that are bound to a target within the same section, whether to the same function or not. (Note that in relocatable code, even ordinary symbols are defined relative to a section. The section of the defined symbol is what matters in the test above; the symbol itself could be a section symbol or an ordinary symbol.)

Since we currently exclude self-calls from unbinding, then by analogy with our dynamic linking case, the effect is equivalent to “all self-reference is as if to a hidden alias”. In short, self-reference is non-interposable by --wrap.

This, however, is the wrong thing. In the earlier example of a recursive call, we said that intuitively, we might well want the self-reference to be unbound and (hence) wrapped, at least in some cases. The function is simply using itself; it can happily do so indirectly via the wrapper. I went with the simpler no-unbind behaviour only because I'm lazy and I didn't face such cases in my application, which is wrapping allocation functions. These are non-recursive by nature. However, my application does have code similar to is_me: a function that wants to be able to recognise its own address. Since unbinding that self-reference broke my code—the function ended up erroneously looking for its wrapper's address—I just stopped unbinding self-reference, by adding the is_self_section predicate above.

If that's wrong, what is right? We need a more nuanced understanding of reference. It's the nature of wrapping that a reference to plain f has become ambiguous; we have to resolve the ambiguity one way or another, either to the wrapper or to the real thing. The old linguist's distinction of “use” versus “mention” is a good fit here. Meta-references, which we've already dealt with, are pretty clearly mentions, not uses. Self-knowledge, like with is_me, is also a mention. But there is also self-use; a recursive call would count. Correct wrapping requires telling these apart, so that self-use can go via a wrapper but self-mention can be direct. This is important.

Self-references may be self-mentions or self-uses.

How can we distinguish self-reference from self-use at the object code level? Unlike with interposition in dynamic linking, an ELF-level protected or hidden-visibility self-alias would not help, because --wrap does not pay any attention to symbol visibilities. It happily wraps a reference to a hidden definition, and/or from a hidden UND. We could layer a rule on top, particular to our unbinding tool, say skipping self-unbinding in the case of hidden symbols. However, this is an abuse of hidden and would stop us ever wrapping a hidden symbol, which we might still want to do.

Instead of a protected or hidden alias, there's another alias which is tailor-made for this ambiguity-resolving purpose: the “__real_” alias! It's an alias that is magically defined only when wrapping, and it's exactly what we want. To do self-reference that is not wrapped, this wrapped function f needs to refer to to __real_f instead of just f. It does mean that the function needs to be written in the knowledge that it's going to be wrapped (or might be). If it wasn't written with this knowledge, a custom fixup pass could be done on its assembly or object code.

The program must disambiguate self-mention from self-use, by using __real_ for self-mention, and leaving self-use the default.

Finally, self-reference has yet another case to worry about, which I'll call “self-reference at a distance”. and arises if inlining is involved. It poses a real problem for correct wrapping, even without unbinding. Imagine an is_me-style function that gets inlined into its caller. This will generate an outgoing reference to is_me, a symbol defined elsewhere, maybe in the same file or maybe further afield. Now suppose we decide to wrap is_me. If it is defined in the same file, we will unbind this outgoing reference (it's not a self-reference) and wrap it. If it isn't—say if is_me is an inline and its out-of-line copy is instantiated in a different file—we don't need unbinding; plain old wrapping will divert the reference into the wrapper. Both of these are wrong! It's still a mention, not a use. The function really wants to get hold of its own identity, not its wrapper's identity. The only change is that the reference now appears to be non-self; it is coming from somewhere distant, thanks to the inlining. As before, the solution is that it must refer to __real_f. In short: in the presence of inlining, any self-mentioning function must at source level use __real_ if it is possibly going to be wrapped. The compiler will know how to emit that reference whatever the inlining context.

Here we can further critique our earlier not-quite-right idea of leaving section-symbol-relative relocs undisturbed but always wrapping ordinary-symbol-relative ones. I mentioned that debugging information sometimes makes reference via ordinary symbols. The cases we just considered must use an ordinary symbol reference, both with or without the use of __real_, because it may refer to a definition in another file. In non-inline cases, instead of using __real_ we could imagine asm hackery inside a function to get hold of a section-relative reference to its own start address. In fact, I tried it. But if inlining might happen, this won't work. The moment this asm trick gets inlined into a caller, it will at best generate a reference to the entry point of the enclosing non-inlined caller, which is not right. My attempt generated a completely spurious offset in the middle of the enclosing function's code, which was not fun to debug. (The intersection of self-referencing functions and inline functions is small, so this issue will arise rarely. But I managed to find it.)

What's to say this use-versus-mention distinction is only an issue for self-reference? There might be mentions that are non-self, i.e. one function mentioning another, but not using it, meaning that semantically it should not be wrapped. An example might be a function that inspects the on-stack return address in order to behave differently depending on which function is calling it. If the caller happens to be wrapped, it still wouldn't be correct to test for the wrapper's address range, since the wrapper is never going to be the immediate caller. Since introspecting on one's caller is a kind of metaprogramming, we could bucket this as another kind of meta-reference; either way, it's clearly a mention not a use. This practice is rather perverse and I haven't run into such cases. But the same solution would apply: mentions of a possibly-wrapped function need to reference __real_.

Another issue with meta-levels is that there can be more than two. I have in the past had cause to write functions of the form __wrap___real_f, i.e. linking with both --wrap f and --wrap __real_f. Perhaps surprisingly, this works to an extent—the first wrapper, which appears to call __real_f, does indeed call into the second, __wrap___real_f. Annoyingly though, referencing __real___real_f doesn't work. This is because there never was a genuine __real_f symbol definition; binding of a __real_-prefixed reference relies on a fiction in the linker, and that is not applied recursively to generate a fictional __real___real_f and so on. Instead I resort to finding the real definition through an alias I created specially earlier.

This is suspiciously starting to resemble an alternative way of doing wrapping without --wrap, as I'll cover next.

Solutions 1a′ and 1b′: doing it in standalone tools

At some point it became clear that my objcopy patches weren't going to be mergeable into GNU binutils. The treatment of debugging sections alone is a giant hack—especially the unwind sections, which I could recognise only by name. I did persist with maintaining my patch for a while (or even letting someone else do it) but maintaining a fork of binutils is no fun. Its long build time was a burden for people wanting to try out my project.

I spent ages trying to find a way to get the desired unbinding effect using only stock objcopy and ld, but concluded that it's not possible. While objcopy lets you add and remove and rename symbols, none of these is right. Adding a new symbol won't make relocs use it, and in any case it won't let us add a new undefined symbol, only a defined one. Renaming a symbol simply leaves all the relocs pointing at this now-renamed symbol. And removing a symbol isn't what we want—in fact objcopy will refuse to do it, on the grounds that there are relocation records depending on it. We want to somehow “make undefined” the existing symbol, so that previously bound relocs will become unbound. Even if we could do this by first adding a new undefined symbol (which we can't), we'd need a way to “switch over” relocation records from one symbol to another. Sadly, objcopy gives us no way to manipulate relocation records as distinct from the symbols they reference.

I therefore wrote a really minimal tool, sym2und, which looks for a symbol of a given name in a given .o file, and changes it to be undefined. With that, we can use the stock tools to complete the job by first adding __def_SYM as an alias of SYM (using ld -r or objcopy --define-sym), then changing SYM to be undefined (using sym2und), then renaming the undefined symbol to __ref_SYM (using objcopy again).

What about the static function case (problem 1b)? For that, as before, we need -ffunction-sections and objcopy --globalize-symbol. And we want a standalone tool in place of my patch's --prefer-non-section-relocs option, to ensure an address-taken function is referenced using the symbol name we expect, rather than a section symbol that we don't. This is again quite simple to write as a standalone tool over ELF files; my version is called normrelocs. Just like my patched objcopy, it needs to skip references in debug sections, to avoid hooking meta-references that happen to point to the entry point of the static function.

What about meta-references, self-references and self reference at a distance?

Meta-references via ordinary (non-section) symbols will be disrupted if the symbol is wrapped, because they will keep referencing the old symbol, which becomes undefined and is later resolved elsewhere. We can use the same debugging-specific hack to extend normrelocs with a converse case which flips from-debug references to use the section symbol.

Self-references that are “mentions” need to be directed at __real_, which will work fine (not wrapped). Self-uses will be diverted to __ref_ which mens they will be wrapped, as we intend.

Self-mentions-at-a-distance simply need to use the __real_ symbol. That is never defined locally in the file, so these will not be wrapped, which is correct.

Solutions 1a′′ and 1b′′: an equivalent?

The approaches we just saw worked by very local surgery to single .o files, in order that a later link-time --wrap option would take fuller effect.

There is a way to reduce (but not eliminate) the need for these custom tools, if we don't mind abandoning --wrap and using another linker feature to do the wrapping. It's a bit fragile, because it is sensitive to the order of .o files on the command line, which normally doesn't matter and which build systems and compiler drivers can make surprisingly difficult to control. It's a big hammer, because it suppresses what would otherwise be link errors. And it doesn't escape the need for an earlier fixup pass, both to handle the statics issue (by normrelocs or similar) and also to create an alias serving the purpose of __real_SYM. It does, however, avoid any need for unbinding, since it works even if SYM is already defined and used in the same .o file. So in some ways it's cleaner and shows that the --wrap option didn't really need to exist.

The feature is the -z muldefs command-line option, which tells the linker to ignore the original symbol definition and use a new one. This does not remove any content from the link—content meaning sections, not symbols. It just affects how that content is labelled, i.e. which symbols end up decorating particular positions or ranges within it, and hence how the content becomes interconnected when relocations are applied.

Instead of --wrap SYM after unbinding, we do the following.

  1. add a fresh name for SYM; we choose __real_SYM by analogy with --wrap
  2. create the wrapper definition called simply SYM (not __wrap_SYM), using __real_SYM if it wants, in a file wrapper.o
  3. include wrapper.o before the original definition on the command line, meaning -z muldefs will give it precedence

We use link order and muldefs to take control of SYM, but the code from its old definition still gets included in the link output; it is just “de-symbol'd” since the earlier definition takes priority. But since earlier we created an alternative symbol name for it, we can still refer to it. Unlike --wrap, this approach happily diverts references that were previously bound to that definition—by providing a new definition of the same name, replacing the old one and acquiring its relocations in the process.

Using this approach is quite a disruptive change, if you're used to using --wrap. You now have to pull out the wrappers into one or more separate objects, so they can be specially placed before the “real” definitions in link order. Previously, your __wrap_ functions might be defined in any of the input .o files. To arrange this “placed before” command-line property in a makefile, my best attempt would be for the link target to set something like CC := $(CC) -Wl,wrapper.o (and hope that no other part of the build logic clobbers CC in a conflicting way). This isn't ideal.

I don't know a way to avoid that problem, but we can at least make things more compatible with --wrap, by changing our wrapper.o to be like the wrappers before: define __wrap_SYM and put it anywhere on the command line. Then we augment it with a linker script wrapper.lds, which does have to go first on the command line, but says just the following.

SYM = __wrap_SYM;

This must be placed first in the command line. This brings us two benefits: we can still call our wrapper __wrap_SYM, rather than SYM, and we can have the wrapper logic originate in any object on our command line. We only require the new wrapper.lds to appear before the other, now-wrapped definition of SYM, whichever object that happens to be in. (This is especially handy if the wrapper is in an archive; usually, moving an archive to the front of the command line is not an option.)

What language is the above snippet in? It's the linker script language. There are two main reasons to write a linker script: to provide additional link inputs or definitions (often also doable via the command line, but unwieldy), and to control the memory layout of the final binary (not doable via the command line). These are more-or-less disjoint. The latter is handled by the “default linker script”. The former can be done by naming additional text files in the linker command, each containing declarations in the linker script language as we saw above. (The boundary between these two use cases is a bit fuzzy; I've found that a dropped-in linker script can also contain SECTIONS rules, which are normally the preserve of the default script. From what I've seen, the semantics seems to be that these are concatenated with those of the default script's SECTIONS rules, although I could be wrong.)

You might ask whether we could avoid our pre-aliasing step that created __real_SYM, by instead making the linker script say something like this.

__real_SYM = SYM;   /* let __real_SYM refer to whatever SYM refers to */
SYM = __wrap_SYM;        /* now let SYM refer to __wrap_SYM? */

This seems like a nice pithy definition of what wrapping should do. Unfortunately it doesn't work. Despite the semicolons and equals signs, this isn't an imperative language. Symbol definitions are treated equationally: the defining expressions are all evaluated at the same time. So what the above really does is set SYM and __real_SYM both to equal __wrap_SYM. This breaks the wrapper, making it recurse infinitely.

(In our use case, therefore, we might like have a “predefine symbol” feature, whose defining expressions are evaluated in an “old” context that is then updated, giving a two-stage semantics. Of course that is still limiting; a fuller solution, with arbitrary dependencies between definitions, would require the linker's algorithm to change from a single “build symbol table” pass into something more complicated and with more failure cases—effectively embedding a “proper” programming language, whether imperative or declarative.)

How does our new approach fare with meta-references, self-reference and self-reference at a distance?

As before, it will screw with debug info in the occasional cases where that's relocated via an ordinary (non-section) symbol. So, we really need our version of normrelocs that does the converse rewriting in the case of debug info sections, making them always use section-symbol-based relocations.

With self-reference, again it will wrap unless you reference the __real_ symbol, for both apparent and at-a-distance kinds of self-reference. As we decided earlier, this is the right thing.

Recap of the methods and test cases

We accumulated the following principles.

Uses of a definition should be wrapped, so force them to use an undefined symbol, unbinding reference from definition if both are present in the same file.
Uses of a definition should be wrapped, and only ordinary symbol-based relocs can be wrapped, so force these references to use ordinary-symbol-based relocs, not section-relative relocs.
Meta-references are mentions and should not be wrapped. They are always local within a file, so force them to use section-symbol-based relocs, which won't be wrapped.
Self-references may be self-mentions or self-uses. The program must disambiguate these, by using __real_ for self-mention, and leaving self-use the default.

Meanwhile, we saw various approaches to realising wrapping, which we developed in several stages.

Can we systematically test and compare these against each other? I'll happily rule out testing those stages with asterisks, since they need some kind of manual intervention (and I never wrote a patch to objcopy that will prefer section-based relocs for meta-references). In our tests we simply assume that -ffunction-sections is already in use if desired and any symbol that's to be wrapped is already a global.

That still leaves us many cases! In total there are eight.

a reference is made to code, ...e.g.?
1inter-sectionfrom codeby named defnormal call or address-take or self-mention/use at a distance
2inter-sectionfrom codeby section(nothing realistic I can think of)
3intra-sectionfrom codeby named defself-call
4intra-sectionfrom codeby sectionself-mention; address-taken label
5inter-sectionfrom databy named deffunction pointer initializer
6inter-sectionfrom databy section(almost nothing realistic I can think of)
7inter-sectionfrom debuginfoby named defsome debuginfo references
8inter-sectionfrom debuginfoby sectionmost debuginfo references

We can write a test that is a single assembly file. Semantically, the executable it builds is nonsense and would crash immediately if run. But we just care about what it looks like after linking. The assembly code includes eight references. For each one, we want to see whether it gets wrapped or not.

.globl callee
.globl caller
.section .text.caller
caller:
L1:
# (1) inter-section, from code to code, by named def
    .quad callee
L2:
# (2) inter-section, from code to code, by section
    .quad .text.99callee
.section .text.99callee, "ax"
callee:
L3:
# (3) intra-section, from code to code, by named def
    .quad callee
L4:
# (4) intra-section, from code to code, by section
    .quad .text.99callee
.section .data, "aw"
L5:
# (5) inter-section, from data to code, by named def
    .quad callee
L6:
# (6) inter-section, from data to code, by section
    .quad .text.99callee
.section .debug_info.text.99callee, "",@progbits
L7:
# (7) inter-section, from debuginfo to code, by named def
    .quad callee
L8:
# (8) inter-section, from debuginfo to code, by section
    .quad .text.99callee

The correct outcome, according to our principles, is that the all the odd number cases (named definitions) except 7 (debug) get wrapped, as do 2 and 6 (inter-section but non-debug) but not 8 (inter-section but debug) or 4 (intra-section by section symbol, so assumed to be a self-mention). So, the wrapped should be 1, 2, 3, 5 and 6.

We can test it for each of our six approaches. To do a test run, we will assemble it and link it, generating a nonsense executable which we interrogate using gdb.

as -L -o test.o test.s
echo 'SECTIONS { .text : { *(*) } }' > allsects.lds # script to link into one big blob
ld  -Ttext=0x0 -e 0x0 -Bstatic -o test test.o allsects.lds

If our file is linked correctly, we should find our test cases labelled in ascending address order, contiguously from 0x8.

$ objdump -t test | grep '^0.*L' | sort -n -k1,1
0000000000000008 l       .text  0000000000000000 L1
0000000000000010 l       .text  0000000000000000 L2
0000000000000018 l       .text  0000000000000000 L3
0000000000000020 l       .text  0000000000000000 L4
0000000000000028 l       .text  0000000000000000 L5
0000000000000030 l       .text  0000000000000000 L6
0000000000000038 l       .text  0000000000000000 L7
0000000000000040 l       .text  0000000000000000 L8

Now we can script gdb to print out the values of all eight references after linking.

$ echo >/tmp/gdbscript <<EOF
file test
set $n=1
while $n<=8
 eval "x/1ga L%d", $n
 set $n=$n+1
end
quit
EOF
$ gdb -q test -x /tmp/gdbscript | tail -n+2 | column -t

On a null run, i.e. of method Z, we get the following.

0x8   <caller>:     0x18  <callee>
0x10  <caller+8>:   0x18  <callee>
0x18  <callee>:     0x18  <callee>
0x20  <callee+8>:   0x18  <callee>
0x28  <callee+16>:  0x18  <callee>
0x30  <callee+24>:  0x18  <callee>
0x38  <callee+32>:  0x18  <callee>
0x40  <callee+40>:  0x18  <callee>

In other words, all references go to the callee and nothing is wrapped. If we test each of our other approaches in turn, we get the results we would expect. Approach A also does no wrapping because without unbinding, intra-file references aren't wrapped. Approach B starts to get it but covers only the easy cases. Approach C gets all except number 3 correct, since it ignores all self-reference whereas case 3 (named symbol) is a self-use that should be unbound. Approaches D and E both wrap exactly the cases our principles agree should be wrapped: 1, 2, 3, 5 and 6. E does it using the muldefs trick, avoid --wrap entirely.

testD  0x0   <caller>:         0x20  <__wrap_callee>
testD  0x8   <L2>:             0x20  <__wrap_callee>
testD  0x10  <__def_callee>:   0x20  <__wrap_callee>
testD  0x18  <L4>:             0x10  <__def_callee>
testD  0x28  <L5>:             0x20  <__wrap_callee>
testD  0x30  <L6>:             0x20  <__wrap_callee>
testD  0x38  <L7>:             0x10  <__def_callee>
testD  0x40  <L8>:             0x10  <__def_callee>
testE  0x8   <caller>:         0x0   <__wrap_callee>
testE  0x10  <L2>:             0x0   <__wrap_callee>
testE  0x18  <__real_callee>:  0x0   <__wrap_callee>
testE  0x20  <L4>:             0x18  <__real_callee>
testE  0x28  <L5>:             0x0   <__wrap_callee>
testE  0x30  <L6>:             0x0   <__wrap_callee>
testE  0x38  <L7>:             0x18  <__real_callee>
testE  0x40  <L8>:             0x18  <__real_callee>

The test is packaged as a single makefile that you can get here, for which you will want my elftin (that contains normrelocs and sym2und) and patched binutils repositories.

Problem 2 (next time, mostly): wrapping cross-DSO references

Comparing muldefs with --wrap, another difference is that with --wrap, the __real_ symbol was never actually defined; it was just a magic alias that could be used to refer to the original unqualified symbol. The __wrap_ symbol did exist, so, in the final executable we have symbols __wrap_SYM and SYM. In contrast, using the muldefs symbol replacement approach, the final executable gives us SYM (our wrapper) and __real_SYM (which we created as a bona fide symbol definition, aliasing the original function). In our “more compatible” approach above we also get __wrap_SYM, but only because we made it an alias of SYM, which isn't essential.

Of these two outcomes, the latter symbol replacement case plays much more nicely with dynamic linking, since it gives the wrapper the identity of the original function SYM. In other words, the wrapped SYM, if exported, will naturally be what an external DSO binds to when it makes reference to SYM. That's always what we want. (Meta-references never cross DSOs, and self-mentions need to use __real_ as before.) By contrast, with ordinary wrapping, SYM would be left referring to the unwrapped symbol, and that will be what gets exported—so cross-DSO calls won't get wrapped.

As hinted at above, it's also useful to do wrapping only at dynamic link time, without using --wrap at all but relying on LD_PRELOAD or, more generally, link order. Of course this only suffices when wrapper and wrappee are in different DSOs. Even then, it is hard to arrange and even harder to do reliably... a topic for another time.

Is this sensible?

We've taken it as a given that messing with symbols at link time is a reasonable thing to want to do. Yet also, we saw pretty early on with static functions that without going back to source, we were limited in what we could do. Even worse, we couldn't be sure that what were doing is even meaningful in a source-level sense. So maybe all this is just a terrible idea?

The issue comes down more generally to whether link-time artifacts—relocatable object code, essentially—have any meaning to the programmer. Are they a notation that can legitimately be manipulated?

Traditionally, one could feel reasonably comfortable doing this because compilers were less clever than they are today. Long ago, most programmers worked close enough to the metal that object code had a place in their mental model. To systems programmers today, this largely remains the case, but to application programmers, it doesn't. Meanwhile, to compiler developers, it is very inconvenient to have this extra intermediate level of abstraction hanging around, at which programmers might want to intervene and mess with the source-level semantics, which to a compiler author are all that matters.

However, since we as “end programmers” often want to plug together code coming from different languages or different toolchains, it should be clear that the compiler author's take isn't really all that matters. Plugging stuff together is a case of integration, and by its nature, integration often involves stepping beyond the view of any single compiler or language implementation.

I've written before, with colleagues, about how linking is used in semantically significant ways, not just as a mechanism for separate compilation. However, POSIX did not standardise linker interfaces and they continue to be unportable, obscure and unreliable. (Witness my colleague Laurie Tratt's recent travails embedding a binary blob into a built binary, which is still only portably doable via a compiler or, slightly less portably, an assembler.) Meanwhile, link-time optimizers (LTO) routinely break link-time interventions of the kind this post has been describing, including symbol wrapping. These optimizers assume that they control the entire path from source down to the final binary. The approach of the LTO systems I've seen is to issue fake intermediate object files containing opaque compiler-specific blobs (serialized IR). At link time, the compiler is re-run in disguise (as a linker plug-in) to reprocess these blobs, discarding the fake contents (which could be used by a non-LTO-aware linker, but normally get ignored). This mirrors the compiler author's perspective: only source semantics matter, and “the usual” intermediates can validly be faked out as a compatibility hack. Of course, if you do funky stuff like symbol wrapping on these intermediates, it will not affect the compiler blobs and so will be discarded.

I believe that object code has value as an integration layer. It has deliberately moved away from the details of source-level semantics, but exposes a “wiring and plumbing” abstraction. This is a convenient level for humans to express compositions—including wrapping, but definitely not limited to that. (My PhD thesis did rather more elaborate stuff at link time, to adapt between binary interfaces. It was influenced by a nice earlier seam of work on linking and configuration languages such as Alastair Reid's Knit.)

Even better, object code actually doesn't lack semantics at all. There is invariably an established, standard and elaborate system for describing its structure and meaning, relating it not just downwards to the machine (a fairly easy step) but upwards to source language features. This system is called debugging information. This is a misnomer since it has many applications besides debugging. All this could certainly be cleaner and tidier—for example, our latent criteria for distinguishing “meta-references” above could and should be a lot more explicit. Also, renaming a symbol in an object file usually won't rename a definition in the corresponding debug info; again, it should. And there is a lot more semantic information that could usefully be included. Nevertheless, the practice of bundling some such information is well established. (So is the practice of unbundling it and throwing it away, unfortunately.) Although many language implementers no doubt see this as an inconvenience that gets in the way of fancy optimisations. a world with good tools for working with object code is a world closer to realising the goal of software as workable ‘stuff’. This is precisely because, counterintuitively, that world is necessarily one where no single tool is so over-powerful as to exclude others. (Readers of Ivan Illich will recognise this idea, of course.)

[/devel] permanent link contact

Mon, 18 Oct 2021

ELF dynamic linking: a brief introduction

[I wrote this to help current or future students who might want to do projects involving dynamic linking. But it may be of interest more widely. There may well be an equivalent or better article out there; if anyone knows one, please let me know.]

In old-fashioned static linking, the executable file captures a complete starting “memory image” of the program being loaded. The kernel simply maps the binary into memory, creates an initial thread stack, dumps some useful information onto it (the auxiliary vector) followed by the program name and argument strings, then transfers control to the entry point identified in the ELF binary's header. So far, so straightforward. I'll assume readers familiar with this. (If not, among other resources, you could read section 3 of the OOPSLA 2016 paper I co-authored.)

In dynamic linking, the executable no longer provides a complete image; rather, it is one of a collection of “dynamic shared objects”, or DSOs, that provide the starting memory image. (Normally, the executable itself is counted as a DSO, even though it may not be “shared”.)

The dynamic linker composes this image at load time, by loading both the executable and the (transitive closure of) depended-on library DSOs.

Dynamic linking is delegated by the kernel to user space: the kernel still only knows how to load static binaries, and the dynamic linker is a special static binary that knows how to load (and link) dynamic binaries. The kernel is extended just enough to split this case off: it must figure out, after loading an executable, that actually the dynamic linker is needed to complete the loading. The dynamic linker can be any binary nominated by the executable file, but usually the standard one (for me: /lib64/ld-linux-x86-64.so.2) is the only one on the system. In practice, the kernel will map both the executable and the dynamic linker into memory, so the dynamic linker only normally needs to load the libraries, not the executable. (Sometimes the dynamic linker can also be run as a program, in which case it has to load the executable too.)

DSOs have dependencies on other DSOs, in the form of NEEDED records. You can see these in the dynamic linking information dumped by readelf -d. For example, here is what I get from a simple hello C program, which needs only one library.

Dynamic section at offset 0x2df8 contains 26 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x1000
 0x000000000000000d (FINI)               0x11b4
 0x0000000000000019 (INIT_ARRAY)         0x3de8
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x3df0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x308
 0x0000000000000005 (STRTAB)             0x3d8
 0x0000000000000006 (SYMTAB)             0x330
 0x000000000000000a (STRSZ)              130 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x4000
 0x0000000000000002 (PLTRELSZ)           24 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x548
 0x0000000000000007 (RELA)               0x488
 0x0000000000000008 (RELASZ)             192 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffb (FLAGS_1)            Flags: PIE
 0x000000006ffffffe (VERNEED)            0x468
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x45a
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

These records give just the names of the depended-on DSOs (actually their soname—more below). Where in the filesystem these binaries are to be found depends on the linker search path, which is controlled by your LD_LIBRARY_PATH environment variable, by a configuration file, and by any RUNPATH records in the loaded file (also visible using the command above, although our hello does not contain any) and, transitively, those in any files loaded to satisfy a NEEDED dependency. The ldd command asks the dynamic linker to do a trial run and show where it located the dependencies (or failed to!).

In memory, each DSO gets a range of the virtual addess space, starting at some load address. So in the DSO's ELF binary on disk, all “addresses” are actually offsets relative to the load address. It used to be that executables always had a load address of 0, so the executable file's addresses were bona fide virtual addresses. However, nowadays executables are typically position-independent and get loaded at a non-zero address too.

(DSOs are mostly contiguous in memory. However, nothing prevents them from containing holes in between their mapped segments. In edge cases it is possible for holey DSOs to be loaded such that they overlap. This is rare, but I did have to go to considerable lengths to support it, or rather prevent it, in one of my runtime projects.)

After loading, there may be some load-time relocation required. This is performed by the dynamic linker and is very much like the relocation that happens during static linking, but is used to resolve inter-DSO references. (In some degenerate cases, it also resolves intra-DSO references that have been deferred until load time.) Many of the entries in the output shown above are telling the dynamic linker where to find the relocation and symbol information for this.

A “shared library” is a library DSO whose text segment does not need load-time relocation. In other words, it works fine as-is when loaded at any address. This flexibility to map the file simultaneously at many addresses, without the need to fix it up in memory after loading, is what allows memory to be shared between many processes under ELF dynamic linking. (Some other linking systems systems, including Windows DLLs, instead force the library to be placed at the same virtual address in all address spaces, hence the infamous process of “registering” DLLs.) Elimination of any need for load-time relocation is achieved by indirecting inter-DSO references that would otherwise be (say) direct call instructions, so that they occur instead via a couple of (non-shared) tables in the data segment: the Global Offset Table (GOT) and Procedure Linkage Table (PLT). Roughly, the latter is used for calls, the former for data access. Others have written about those (see Eli Bendersky's very nice in-depth article), so let me just say that the GOT is exceptionally poorly named: it is not global (there is one per DSO), and does not contain offsets (it contains pointers, at least after relocation). They should both arguably be called “vectors” rather than tables, because they don't have multiple columns.

Despite the “S” in “DSO”, there is no requirement that library DSOs satisfy this shareability property. Instead, ELF linkers usually complain if you try to create a DSO that does not (often via an inscrutable message ending with “recompile with -fPIC”). The dynamic linker doesn't care, however. If it has to fix up your text, it can do so, relying on copy-on-write to allocate memory as needed. Shared libraries are “shared” only because that copying doesn't happen.

Since libraries and executable are now deployed as separate files, they are subject to version skew. Libraries often try to offer backward compatibility, such that a newer library can still support an older executable (or, equally, support some old library declaring it as NEEDED). A vaguely standard symbol versioning mechanism is defined for ELF, to allow a library to offer older and newer versions of “the same” symbol and have references get bound to the right one, preserving binary compatibility. Alternatively, a binary-incompatible version of a shared library may be issued, in which case it's good practice to reflect that in the library name. There is a convention for recording version bumps in a number that forms part of the “soname”, or shared object name, hence “libc.so.6” in the above listing. This suggests six version-bumps have occurred in the GNU C library. In this way, many distinct incompatible verisons of a library may be installed on the same system. Symbol versioning takes a bit more discipline and more build-time ceremony to set up, but it tends to be preferred over sonames, especially for libraries that are widely deployed. The GNU C library has not bumped its soname for many years.

DSOs have identity at run time. The dynamic linker maintains a doubly linked list called the “link map”. It is possible to load new DSOs, such as plug-ins, during execution, via the dlopen() library function. This adds them to the link map, alongside all the libraries loaded earlier. Indeed the dynamic linker itself shows up in the link map as just another library, providing (among other things) the dlopen call (or at least its back end; on GNU systems there is a separate libdl as a front-end). It also provides a symbol that gives access to the link map itself (_r_debug). Over this, a breakpoint-based protocol allows debuggers to walk the link map, so they can find out about the libraries that are loaded. DSOs can also designate initialization and finalization routines to be run when they are loaded and unloaded.

That's enough for a brief introduction. To get more details about what's going on at run time, you could read my older post about ELF introspection.

A few years back, Simon Ser did some great work on an executable semantics for dynamic linking, as a summer internship of the REMS project. Since then I have been struggling to get time to resurrect and publish this work, and it no longer falls directly within my prioritised interests. But if anyone is interested in taking it up, do contact me... it is begging to be polished off and written up.

[/devel] permanent link contact

Thu, 14 Oct 2021

Tracing system calls in-process, using a chain loader

Some months back I wrote about a possible better alternative to LD_PRELOAD using chain loading. And I promised a follow-up with more details... so here it is. We'll walk through a little system call tracer built using this technique. Unlike strace and similar, it will do its tracing in-process, without using ptrace() or equivalents. I should add that it's a proof-of-concept only so isn't nearly as good as strace in practice; in particular it can't yet accurately print the arguments to most syscalls. (I have some work towards doing that generatively, using kernel DWARF; more later.)

System call tracing is a good demo of chain loading, for three reasons.

Firstly, it's basically impossible to do with LD_PRELOAD, which works at the higher level of the C library and dynamic linker, so doesn't have sight of system calls.

Secondly, it's close to what people very often want to do with LD_PRELOAD, which is to change the behaviour of particular system calls. No longer do we have to make do with overriding the C library wrappers; we can intercept the system calls themselves, catching certain cases that would not be visible at the C library level.

Thirdly, intercepting system calls specifically provides the foundation for a more general bootstrapping approach to arbitrary binary instrumentation of programs, as I hinted last time. If we can catch all the system calls which introduce new instructions to the address space, and prevent instructions from being introduced otherwise, we can ensure that literally all code is instrumented.

Some years ago, Patrick Lambein and I wrote a library that can patch and emulate most system calls on x86_64 Linux, by a really simple double-trap technique: overwrite the system call with ud2, x86's “always-illegal” instruction, then handle the resulting SIGILL. The handler can do the real syscall or whatever else it wants, before returning. This is not the fastest, but it's reliable and easy(-ish) to implement—at least to 90% of the way. So, for example, if we disassemble the C library's _exit function which once contained the following instructions

... 
44 89 c0                mov    %r8d,%eax
0f 05                   syscall  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>
...

under our library this gets clobbered to instead be

... 
44 89 c0                mov    %r8d,%eax
0f 0b                   ud2  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>
...

which will always cause a SIGILL. Our SIGILL handler scrapes up the syscall arguments into a structure and then passes this to whatever registered handler is installed. For example, a tracing handler, which prints some stuff and then resumes the system call, looks like this.

void print_pre_syscall(void *stream, struct generic_syscall *gsp, void *calling_addr, struct link_map *calling_object, void *ret) {
        char namebuf[5];
        snprintf(namebuf, sizeof namebuf, "%d", gsp->syscall_number);
        fprintf(stream, "== %d == > %p (%s+0x%x) %s(%p, %p, %p, %p, %p, %p)\n",
                getpid(), 
                calling_addr,
                calling_object ? calling_object->l_name : "(unknown)",
                calling_object ? (char*) calling_addr - (char*) calling_object->l_addr : (ptrdiff_t) 0,
                syscall_names[gsp->syscall_number]?:namebuf,
                gsp->args[0],
                gsp->args[1],
                gsp->args[2],
                gsp->args[3],
                gsp->args[4],
                gsp->args[5]
        );
        fflush(stream);
}
// what actually gets called from the SIGILL handler
void systrap_pre_handling(struct generic_syscall *gsp)
{
        void *calling_addr = generic_syscall_get_ip(gsp);
        struct link_map *calling_object = get_highest_loaded_object_below(calling_addr);
        print_pre_syscall(traces_out, gsp, calling_addr, calling_object, NULL);
}

To ensure that all loaded code gets instrumented appropriately with the ud2s, I've recently expanded this to use a bootstrapping technique: after initially applying our trapping technique to hook mmap(), mremap() and mprotect() calls, our hooks ensure that if the guest application is asking to create any further executable memory, we will instrument its contents before we make it executable.

But how can we get all this code into the process in the first place?. What I did initially was use our old familiar LD_PRELOAD. We preload a library whose initializer stomps over the loaded text, clobbering syscalls into ud2 and installing the SIGILL handler. This is fine as far as it goes, but it misses the very beginning of execution. A tracer doing this cannot trace anything happening before the initializer runs, for example in other libraries' initializers or inside the dynamic linker. What we get is the following.

$ LD_PRELOAD=`pwd`/trace-syscalls.so /bin/true
== 26439 == > 0x7f6fb2a4f9d4 (/lib/x86_64-linux-gnu/libc.so.6+0xc99d4) exit_group(0, 0x3c, 0, 0x7fff766064b0, 0xe7, 0xffffffffffffff80)

It's neat that it looks like true only does one syscall, to exit_group(), as we might think it should. But if we try strace we find that in truth the process does a lot more than that one system call.

$ strace /bin/true
brk(0)                                  = 0x2447000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6ba773c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64", 0x7ffd49ca38e0) = -1 ENOENT (No such file or directory)
... snipped a *lot*! ...
arch_prctl(ARCH_SET_FS, 0x7f6ba7578740) = 0
mprotect(0x7f6ba7732000, 12288, PROT_READ) = 0
mprotect(0x605000, 4096, PROT_READ)     = 0
mprotect(0x7f6ba7765000, 4096, PROT_READ) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

All these other, earlier system calls are happening during dynamic linking, or in general, before our constructor that clobbers the syscalls into ud2. (The strace manual page laments: “it is a pity that so much tracing clutter is produced by systems employing shared libraries”. My take is that it's better to know the truth!)

Let's repackage our code as a chain loader, as described last time. That way it will always run first, and because it's responsible for loading the dynamic linker, we can ensure that all its syscalls are trapped from the very start. This is a non-trivial delta to the donald-based code I mentioned in the last post. After we map the “real” dynamic linker, we also need to instrument its code. We do this by wrapping the enter() function from donald.

void __wrap_enter(void *entry_point)
{
        __libsystrap_force_init();
        init_fds();
        /* ... */
        trap_all_mappings(); // only ld.so is loaded right now
        install_sigill_handler();
        /* ... */
}

What about code that the dynamic linker loads? Well, we use our bootstrapping approach to catch that. Whenever new code is loaded into the process, some system call is responsible for doing that. So as long as we catch that call, we can trap the system calls at load time. A bit of macro magic helps generate the wrappers, and some elided hackery is required to divine the ELF-level structure of the file being mapped.

#define mmap_args(v, sep) \
    v(void *, addr, 0) sep   v(size_t, length, 1) sep \
    v(int, prot, 2) sep      v(int, flags, 3) sep \
    v(int, fd, 4) sep        v(unsigned long, offset, 5)
bootstrap_proto(mmap)
{
    if (!(prot & PROT_EXEC)) return mmap(addr, length, prot, flags, fd, offset);
    // else... always map writable, non-executable initially
    int temporary_prot = (prot | PROT_WRITE | PROT_READ) & ~PROT_EXEC;
    void *mapping = (void*) mmap(addr, length, temporary_prot, flags, fd, offset);
    if (mapping != MAP_FAILED)
    {
        /* ... */
        if (/* needs trapping*/)
        {
            inferred_vaddr = (uintptr_t) mapping - matched->p_vaddr;
            trap_one_executable_region_given_shdrs(mapping, (void*) mapping + length,
                    filename, /* is_writable */ 1, /* is_readable */ 1,
                    shdrs, ehdr.e_shnum,
                    inferred_vaddr);
            // OK, done the trap so let's fix up the protection (always different)
            ret = mprotect(mapping, length, prot);
            if (ret != 0) goto fail;
            /* ... */
        }
        /* ... */
    }
    /* ... */
}

Making this work on Linux/x86 (32- and 64-bit) required some “interesting” code in a few cases, over and above the chain loader trick. The current version still has some unsupported features (sigaltstack() among them) but can support a lot of stuff thanks to the following gyrations.

The good news is that things mostly work. In particular, I can do:

$ ./trace-syscalls-ld.so /bin/true

and get “full” output analogous to that of strace, instead of the old output that was missing many system calls.

What about correctness of all this? This bootstrapping approach only gets us 100% coverage of system calls on some assumptions.

Assumption 1 isn't true for gettimeofday() in Linux's vdso, which implements it as a read from kernel memory. No “interesting” system calls are implemented that way. Do we really just mean “system calls”, though? One could argue that fault-handling paths are in some ways like system calls; we can't trace entries and exits to/from these, though neither does strace. Signal delivery is also a bit like a return from a system call, and we don't trace that, in contrast to strace (which gets any event ptrace exposes, and that includes signals). We could trace some of it by installing our own handlers for all signals, though that would require a lot of care to preserve the program's original semantics... it starts to become a full-on “transparency” problem, of the kind DynamioRIO's authors have articulated.

For assumption 2a to be true, we must forbid mappings that are simultaneously writable and executable. Some programs, such as the just-in-time compilers of V8 or PyPy, still likes to create executable memory and write into it. If we allow such mappings, we have no synchronous opportunity to modify code before it becomes executable. My best guess is to allow these programs to carry on with an emulation of writable-and-executable mappings, adaptively switching between really-writable and really-executable and handling the segmentation faults in user space. The overhead would be great if done naively, but some kind of adaptive hysteresis might handle it well enough (even maybe as simple as “flip after N faults”).

A more subtle version of the same problem is that we need to avoid distinct simultaneous W and X mappings of the same underlying object, such as a file that is MAP_SHARED. That would have the same effect as a writable-and-executable mapping. Such a check is easy in principle, but requires tracking the inode and device number behind every existing file mapping. (File descriptors are OK because we can fstat() them on demand.) Attempts to remap the same object should probably return -ETXTBSY. I've left that as an exercise for someone else, for the moment.

Assumption 2b brings the issue of overt instructions versus punny instructions. On x86 there is no alignment requirement for instructions, so we can jump into the middle of instructions and thereby do something quite different than intended, perhaps including system calls. For example, the innocuous instruction

 mov    $0x9090050f,%eax

assembles to

 b8 0f 05 90 90

but if we jump one byte into this byte sequence, we do a syscall followed by two nops. Preventing this is tricky. Perhaps we trust code not to do this. Perhaps we built the program using some heavyweight control-flow integrity mechanism (and can also check that dynamically loaded code satisfies its requirements). Perhaps we are on an alignment-preserving architecture (so, not x86). For x86, once we have a sufficiently general jump-based trampoline mechanism, perhaps itself making use of instruction punning, we can forcibly stub out any instruction or instruction pair that internally contains a system call byte sequence. Doing this is sufficient only if our trampoline itself is guaranteed not to include such instructions. Such a guarantee is quite hard when immediates are involved... we would have to forbid them to contain the bytes 0f 05, for example. The same is true for addresses or offsets, but these are easier under PC-relative addressing (we get to choose where to place the trampoline). Most likely we can rely on such immediates being so rare that if we simply clobber the instruction with a ud2 and emulate it with something like libx86emulate, nobody will notice any slowdown.

Erk. That was a deeper dive than I planned. But the take-home is that early experience suggests this seems to be possible. We can intercept system cals using a chain loader, bootstrappingly, and so get control of the program's interactions across the system call interface. This lets us build all manner of tools rather like LD_PRELOAD, but directly at the system call level and with stronger coverage properties.

[/devel] permanent link contact

Mon, 04 Jan 2021

Chain loading, not preloading: the dynamic linker as a virtualization vector

Suppose you want to run a pre-existing binary program but in some kind of instrumented or modified form. On ELF-based systems, people often do this with LD_PRELOAD: you preload a library that interferes with some C library calls or somehow tweaks the process state at startup (say to install a signal handler). The program itself is (hopefully) oblivious to what you've done, but its behaviour is modified accordingly. Some years back I saw a nice survey of applications in a presentation by Kevin Pulo; there may be other resources out there.

This post is about some alternative ELF-level trickery, that is sometimes a better option for doing this. Here it's shown on GNU/Linux, but any ELF platform should allow it.

If you've tried LD_PRELOAD, you'll know that there are lots of problems and gotchas with it. Your code has to deal with being initialized at an unpredictable time; the C library's binary interface evolves from version to version, so is prone to additions that sidestep your preloads; it doesn't work for programs that do direct system calls (not via the libc) or are statically linked. What I'm about to describe avoids many of these problems, by creating what I'll call an ELF chain loader. This loader must do any program instrumentation at the instruction level, rather than the linkage level as with LD_PRELOAD. So in a sense, chain loading is less powerful than preloading: it's working at a lower level, and all your tweaks must be somehow “installed” at startup, rather than as part of the “dialogue” of library calls. However, there's a pattern we can use to overcome that limitation, so in fact I'd argue the result is more powerful than LD_PRELOAD, although (for now) harder to use.

Suppose you have a command invocation

$ ./myprog

and want your tweaked command to be invocable something like

$ LD_PRELOAD=mymods.so ./myprog

or (nicer)

$ ./mymods ./myprog

In the latter, mymods is a chain loader binary. A chain loader is basically a stub dynamic linker. It doesn't actually do dynamic linking; it's invoked like a dynamic linker, but it just loads another dynamic linker—after doing whatever tweaks and fiddles you'd like it to do.

Dan Williams has written (on a page now sadly disappeared, but WayBack-available) about how to build your own dynamic linker or ld.so. Building a binary of the right form is the easy part. Building one that actually links and runs real programs is a big job, especially if you're using a complex C library (like glibc) or complex run-time features (like thread-local storage). But in outline, a dynamic linker basically looks like this.

  1. do bootstrap relocation
  2. initialize some stuff (link map, TLS, ...)
  3. locate and map libraries, transitively, resolving symbols, applying relocations, etc.
  4. call constructors
  5. jump to the main program's entry point

The chain loader that we'll build is basically just a stub, which will call onward to the real ld.so to handle most of the above stuff. So it is much simpler:

  1. do bootstrap relocation
  2. mess with the process however we like
  3. locate and map the next loader (e.g. the real dynamic linker)
  4. (optional) disguise the fact that we ever existed
  5. jump to the next loader's entry point

If we do it right, the next loader won't even know that the stub loader has run. We can think of a chain loader as a “virtualization vector”. It sets up a (somehow-) virtualized environment, in which the entire original process, from dynamic linking onwards, then proceeds to execute.

(The “disguise” part is interesting. It makes us almost indistinguishable from malware! Virtualization is about transparently subverting a pre-existing process's execution, and there's a large class of malware that also does this.)

A few years ago I wrote a skeleton of a dynamic linker, called donald. It can load only very simple statically linked programs—hardly a real dynamic linker! Nevertheless it's a good vehicle for understanding what goes on during process start-up, and is a basis for building our chain loader. Caveat: it's very quick-and-dirty; it doesn't always use robust coding patterns, and it uses some unportable hacks to keep the code simple. That makes it possible to show almost all of it in a single blog post! Following Dan Williams's recipe, the command we use to link donald is the following.

$ cc -std=gnu99 -g -fPIC -fno-stack-protector \
  -fuse-ld=bfd -Wl,-Bsymbolic -nostdlib -nostartfiles -shared \
  -o "donald.so" premain.o main.o entry.o load.o \
  -Wl,-Bstatic /path/to/musl/lib/libc.a -Wl,-Bsymbolic \
  -T donald.lds -Wl,-soname=ld-linux.so.2 

where we're using a nice simple C library implementation, namely musl, to provide the libc calls that we use internally. The donald.lds file is a lightly tweaked linker script that adds a symbol that'll help us do bootstrap relocation, as per Dan Williams's article. The work of donald is done in the C files premain.c, main.c, entry.c and load.c.

“Bootstrap relocation” refers to the solution to an essential problem: who links the dynamic linker? In old-fashioned static linking scenarios, binaries contain a ready-to-go memory image. The program is loaded at address 0, all address bindings are known, and references have been fixed up when the binary was linked. By contrast, in dynamic linking scenarios, a library might be loaded at any address, meaning it still contains some relocation records describing fixups needed at load time. These are done by the dynamic linker. But what if you are the dynamic linker? The dynamic linker itself is just ld.so, a shared library; it, too, can be loaded anywhere. So who relocates the relocator? The answer: it relocates itself. The entry path of the dynamic linker consists of code specially crafted to run correctly regardless of where it is loaded. One of its first tasks is to locate the table of its own relocation records, relocate itself using them, and then continue execution in a “normal” environment.

On the x86-64 architecture, this is pretty straightforward because most data and code references are PC-relative. (On other architectures it's often much hairier, because it's harder to avoid using any absolute address references before you've completed bootstrap relocation.) Usually the entry path is written in per-arch assembly, precisely to accommodate the less straightforward architectures. In donald, which works only on x86-64, we have the option to write it directly “in C”. It's not really C, because we're bound by a very wacky contract: don't write any code that will cause the compiler to use non-relative addressing modes! The entry point of donald, called directly by the kernel, looks like the following.

/* The function prologue pushes rbp on entry, decrementing the stack
 * pointer by 8. Then it saves rsp into rbp. So by the time we see rbp, 
 * it holds the entry stack pointer *minus 8 bytes*. */
#define BP_TO_SP_FIXUP 0x8
void *rsp_on_entry HIDDEN;
/* This isn't the usual "main"; it's the raw entry point of the application.
 * We link with -nostartfiles. We then define our own main. */
int _start(void)
{
    /* gcc doesn't let us disable prologue/epilogue, so we have to fudge it.
     * We assume rsp is saved into rbp in the prologue. */
    register unsigned char *bp_after_main_prologue;
    __asm__ ("movq %%rbp, %0\n" : "=r"(bp_after_main_prologue));

    int argc;
    char **argv;
    rsp_on_entry = bp_after_main_prologue + BP_TO_SP_FIXUP;
    preinit(rsp_on_entry, &argc, &argv); // get us a sane environment

    printf("Hello from " DONALD_NAME "!\n");

    int ret = main(argc, argv);

    /* We're executing without startfile code, so returning 0 would not make sense. 
     * Calling exit() brings a dependency on fini_array stuff that we need to avoid
     * since it depends on the startup files. So just do the syscall directly. */
    syscall(SYS_exit, ret);

    __builtin_unreachable();
}

When the kernel jumps to the above code, the stack pointer is pointing at a bunch of useful information: the program arguments and so forth. But the C compiler will generate a prologue that moves it away from that. So the first thing we do is use a hard-coded hack to “undo” that, and get the “stack pointer on entry”, which we pass to our preinit function.

Using this pointer, preinit can navigate the initial stack to find all that data of interest. The data starts with argc; above that is argv (the vector of pointers to argument strings), above that envp (the same but for environment variables), then the auxiliary vector of records containing “stuff the kernel wanted to tell us” and finally the strings these argv and envp vectors point to. You can read more about auxiliary vectors here. Our preinit parses this chunk of memory, and extracts a couple of facts that we'll need to do our work: the page size and our own base address.

Bootstrap relocation uses the latter of these. We find our own dynamic linking information (symbol table and relocation table) using the _DYNAMIC symbol. But remember “this isn't C”: simply referring to _DYNAMIC may not work yet, because it's defined outside the current compilation unit (by the linker, in fact) so the compiler may emit a reference that needs relocation. By definition, the relocation hasn't been applied yet. In donald I used an unorthodox (and unportable) hack: taking the address of _DYNAMIC in our C code before we relocate ourselves gets us the unrelocated (file-relative) address, to which we manually add our own base address as snarfed from the auxiliary vector. (Better code would probably declare _DYNAMIC as file-local, using the ELF “hidden” or “internal” visibility, which would allow the linker to fix this up into rip-relative addressing.) Either way, we can get to the _DYNAMIC vector, which gets us to our symbol and relocation tables; then we walk the relocation table and apply each reloc.

static inline void __attribute__((always_inline)) bootstrap_relocate(unsigned char *at_base)
{
    /* We scan _DYNAMIC to get our own symbol table.
     * HACK: we manually relocate &_DYNAMIC
     * by our load address to get its actual address. */
    ElfW(Dyn) *p_dyn = (void*)(at_base + (uintptr_t) &_DYNAMIC);
    ElfW(Sym) *dynsym_start = NULL;
    unsigned long dynsym_nsyms = 0;
    ElfW(Rela) *rela_dyn_start = NULL;
    ElfW(Rela) *rela_plt_start = NULL;
    unsigned long rela_dyn_sz = 0;
    unsigned long rela_dyn_entsz = 0;
    unsigned long rela_dyn_nents = 0;
    unsigned long rela_plt_sz = 0;
    while (p_dyn->d_tag != DT_NULL)
    {
        if (p_dyn->d_tag == DT_SYMTAB) dynsym_start = (void*)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_SYMENT) dynsym_nsyms = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_RELA) rela_dyn_start = (void *)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_RELASZ) rela_dyn_sz = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_RELAENT) rela_dyn_entsz = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_JMPREL) rela_plt_start = (void *)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_PLTRELSZ) rela_plt_sz = p_dyn->d_un.d_val;
        ++p_dyn;
    }
    if (rela_dyn_entsz > 0) rela_dyn_nents = rela_dyn_sz / rela_dyn_entsz;
    /* We loop over the relocs table and relocate what needs relocating. */
    ElfW(Rela) *p_rela = rela_dyn_start;
    for (int i = 0; i < rela_dyn_nents; ++i)
    {
        do_one_rela(rela_dyn_start + i, at_base, dynsym_start);
    }
    p_rela = rela_plt_start;
    /* Also do .rela.plt */
    for (int i = 0; i < (rela_plt_sz / sizeof (Elf64_Rela)); ++i)
    {
        do_one_rela(rela_plt_start + i, at_base, dynsym_start);
    }

I haven't shown the function that applies a single relocation record, do_one_rela, but it's pretty simple. In practice it only needs to know a few flavours of relocation.

static inline void __attribute__((always_inline)) 
do_one_rela(ElfW(Rela) *p_rela, unsigned char *at_base, ElfW(Sym) *p_dynsym)
{
#define SYMADDR(r_info) (p_dynsym[ELF64_R_SYM((r_info))].st_value)
    Elf64_Addr *reloc_addr = (Elf64_Addr *)(at_base + p_rela->r_offset);
    switch (ELF64_R_TYPE(p_rela->r_info))
    {
        case R_X86_64_RELATIVE: // no symbol addr, because we're RELATIVE
            *reloc_addr = (Elf64_Addr)(at_base + p_rela->r_addend); 
            break;
        case R_X86_64_64: 
            *reloc_addr = (Elf64_Addr)(at_base + SYMADDR(p_rela->r_info) + p_rela->r_addend);
            break;
        case R_X86_64_JUMP_SLOT:
        case R_X86_64_GLOB_DAT:
            *reloc_addr = (Elf64_Addr)(at_base + SYMADDR(p_rela->r_info));
            break;
        default: 
            /* We can't report an error in any useful way here. */
            break;
    }
#undef SYMADDR
}

We're now getting ready to take on the main task of being a chain loader, which plain donald is not (at least not intentionally). Ideally we want to be invocable in two ways: as a bona fide dynamic linker requested directly by an executable, as we see here:

$ readelf -Wl /bin/blah

Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /path/to/my/ld.so]
  ...

and as a command-line tool

$ /path/to/my/ld.so /bin/blah

These two cases work out slightly differently at run time. The first is the “normal” case, and the kernel has already mapped both the program itself and us the dynamic linker. In the latter, the kernel thinks it's loading us as a statically linked binary (curiously not caring that we're ET_DYN not ET_EXEC), and it's up to us to parse our arguments, identify the program we want to run, and load it. The contents of the auxiliary vector are slightly different in the two cases, and indeed that's how we'll tell the difference and behave accordingly.

int main(int argc, char **argv)
{
    // were we invoked by name, or as a .interp?
    // use AT_ENTRY to find out: it's _start if we were invoked as a program,
    // otherwise it's the program's _start
    int argv_program_ind;
    uintptr_t entry = (uintptr_t) &_start;
    _Bool we_are_the_program = 1;
    for (ElfW(auxv_t) *p = p_auxv; p->a_type; ++p)
    {
        switch (p->a_type)
        {
            case AT_ENTRY:
                if (p->a_un.a_val != (uintptr_t) &_start) we_are_the_program = 0;
                entry = p->a_un.a_val;
                break;
            default:
                break;
        }
    }
    fprintf(stderr, "We think we are%sthe program\n", we_are_the_program ? " " : " not ");

Since we're a chain loader, most of what we want to do is load the next loader and jump to it. To load the next loader, we'll need to memory-map some chunks of the file, by reading its ELF header and program headers. We can simply do this by file I/O.

    /* We always chain-load the ld.so and let it load the program. Let's read it. */
    const char ldso_path[] = "/lib64/ld-linux-x86-64.so.2"; // HACK: sysdep
    int ldso_fd = open(ldso_path, O_RDONLY);
    if (ldso_fd == -1) { die("could not open %s\n", ldso_path); }
    struct stat ldso_stat;
    int ret = fstat(ldso_fd, &ldso_stat);
    if (ret != 0) { die("could not stat %s\n", ldso_path); }

    // read the ELF header
    ssize_t nread;
    ElfW(Ehdr) ehdr; nread = read(ldso_fd, &ehdr, sizeof (ElfW(Ehdr)));
    if (nread != sizeof (ElfW(Ehdr))) die("could not read ELF header of %s\n", ldso_path);
    // check it's a file we can grok
    if (ehdr.e_ident[EI_MAG0] != 0x7f
        || ehdr.e_ident[EI_MAG1] != 'E'
        || ehdr.e_ident[EI_MAG2] != 'L'
        || ehdr.e_ident[EI_MAG3] != 'F'
        || ehdr.e_ident[EI_CLASS] != ELFCLASS64
        || ehdr.e_ident[EI_DATA] != ELFDATA2LSB
        || ehdr.e_ident[EI_VERSION] != EV_CURRENT
        || (ehdr.e_ident[EI_OSABI] != ELFOSABI_SYSV && ehdr.e_ident[EI_OSABI] != ELFOSABI_GNU)
        // || phdr->e_ident[EI_ABIVERSION] != /* what? */
        || ehdr.e_type != ET_DYN
        || ehdr.e_machine != EM_X86_64
        )
    {
        die("unsupported file: %s\n", ldso_path);
    }
    off_t newloc = lseek(ldso_fd, ehdr.e_phoff, SEEK_SET);

    // process the PT_LOADs
    ElfW(Phdr) phdrs[ehdr.e_phnum];
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
    {
        off_t off = ehdr.e_phoff + i * ehdr.e_phentsize;
        newloc = lseek(ldso_fd, off, SEEK_SET);
        if (newloc != off) die("could not seek to program header %d in %s\n", i, ldso_path);
        size_t ntoread = MIN(sizeof phdrs[0], ehdr.e_phentsize);
        nread = read(ldso_fd, &phdrs[i], ntoread);
        if (nread != ntoread) die("could not read program header %d in %s\n", i, ldso_path);
    }

This is all fairly common to any dynamic linker. Now we've collected the PT_LOAD program headers, which tell us which parts of the file to map into memory. To rule out partial failures, we reserve enough memory to be sure we can map the loader in a big contiguous chunk, even though that might be split over many PT_LOADs.

    ElfW(Addr) max_vaddr = 0;
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
    {
        ElfW(Addr) max_vaddr_this_obj = phdrs[i].p_vaddr + phdrs[i].p_memsz;
        if (max_vaddr_this_obj > max_vaddr) max_vaddr = max_vaddr_this_obj;
    }
    uintptr_t base_addr_hint = 0x555555556000;
    void *base = mmap((void*) base_addr_hint, max_vaddr, PROT_NONE, MAP_PRIVATE,
        ldso_fd, 0);
    if (base == MAP_FAILED) die("could not map %s with PROT_NONE\n", ldso_path);
    uintptr_t base_addr = (uintptr_t) base;

Once we've done that, we have reserved space at some load address which could be arbitrary (but we'll use a hint of 0x555555556000)... all we need to do is turn each PT_LOAD header into the (usually) one corresponding mmap() invocation.

    uintptr_t phdrs_addr = 0;
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
    {
        if (phdrs[i].p_type == PT_LOAD)
        {
            _Bool read = (phdrs[i].p_flags & PF_R);
            _Bool write = (phdrs[i].p_flags & PF_W);
            _Bool exec = (phdrs[i].p_flags & PF_X);

            if (phdrs[i].p_offset < ehdr.e_phoff
                    && phdrs[i].p_filesz >= ehdr.e_phoff + (ehdr.e_phnum + ehdr.e_phentsize))
            {
                phdrs_addr = base_addr + phdrs[i].p_vaddr + (ehdr.e_phoff - phdrs[i].p_offset);
            }
            ret = load_one_phdr(base_addr, ldso_fd, phdrs[i].p_vaddr,
                phdrs[i].p_offset, phdrs[i].p_memsz, phdrs[i].p_filesz, read, write, exec);
            switch (ret)
            {
                case 2: die("file %s has bad PT_LOAD filesz/memsz (phdr index %d)\n", 
                        ldso_path, i);
                case 1: die("could not create mapping for PT_LOAD phdr index %d\n", i);
                case 0: break;
                default:
                    die("BUG: mysterious error in load_one_phdr() for PT_LOAD phdr index %d\n", i);
                    break;
            }
        }
    }

Now we've mapped the program we want to run, i.e. the next loader. In a real dynamic linker, our work would only just be beginning: we've mapped the next program, but still have to walk its depended-on libraries and load those (transitively), apply relocations in all the above, and so on. But in our case, we're most of the way there. All we have to do is do whatever tinkering is demanded by our use case, then transfer control to the next loader, using the entry point in its ELF headers. The next loader is not just any old program, however. To cover our tracks and ensure it can find the auxiliary vector just as we ded, we also return the stack pointer to where it was when we started.

void __attribute__((noreturn)) enter(void *entry_point)
{
    fprintf(stderr, DONALD_NAME ": jumping to system ld.so entry point %p with rsp %p\n",
        (void*) entry_point, rsp_on_entry);
    fflush(stderr);
    __asm__ volatile ("movq %0, %%rsp\n"
          "xorq %%rbp, %%rbp\n" /* clear rbp to avoid confusing stack walkers */
          "jmpq *%1\n" : : "m"(rsp_on_entry), "r"(entry_point));
    __builtin_unreachable();
}

In fact, we need to cover our tracks a bit better. If we're being invoked as a command, not as a requested dynamic linker, then we want to make the auxiliary vector look as if the next loader is the one that was invoked as a command too. So we do a pass over the vector to fix it up. One such fixup is required in the opposite (“requested”) case too.

for (ElfW(auxv_t) *p = p_auxv; p->a_type; ++p)
    {
        switch (p->a_type)
        {
            case AT_ENTRY:
                if (we_are_the_program) p->a_un.a_val = entry_point;
                break;
            case AT_PHDR:
                if (we_are_the_program) p->a_un.a_val = phdrs_addr;
                break;
            case AT_PHENT:
                if (we_are_the_program) p->a_un.a_val = ehdr.e_phentsize;
                break;
            case AT_PHNUM:
                if (we_are_the_program) p->a_un.a_val = ehdr.e_phnum;
                break;
            case AT_BASE:
                if (!we_are_the_program) p->a_un.a_val = base_addr;
                break;
            case AT_EXECFN:
                if (we_are_the_program) p->a_un.a_val = (uintptr_t) &argv[0];
                break;
        }
    }

Note that in the “requested” case, since the AT_BASE vector entry is set to the loader's load address, it needs to be fixed up to that of the next loader. If we're invoked as a command, it'll be absent or 0 and can stay that way. The other cases are ones that only need fixing up if we're run as a command, as the vector needs to reflect the next loader's structure not our own (and not the program's, which it only reflects if we're run as a requested loader).

There's more bit of track-covering we can do, and in fact need to, if we're “requested” rather than invoked. One interesting quirk of what we have so far is that the next loader, although its name is /lib64/ld-linux-x86_64.so.2, will think it's called /path/to/my/ld.so, and will create a link map entry accordingly. This will confuse debuggers, e.g. because they will look at the wrong file's symbols. It took me a while to figure out where the dynamic linker gets its own name, but it's obvious in hindsight: from the .interp string in the mapped program binary! (If we were invoked, it more obviously comes from argv.) So to avoid confusion, we want to overwrite the .interp string before we call the next loader. That takes some jiggery-pokery at link time, but we're already doing some of that to be an alternative requested linker—i.e. to link with --dynamic-linker=/path/to/my/ld.so. Linking in a small additional file assembled from

#if defined(__linux__) && defined(__ELF__)
.section .note.GNU-stack,"",%progbits
#endif
.section .interp, "aw"

is enough to make the .interp writable, and can also be used to add more padding if the stock path (which we'll be writing) might be longer than our requested path.

There's another wrinkle with this. Since the default linker script will place .interp among stuff that is otherwise read-only, this extra linked file will also have the side-effect of making writable the program headers, various notes and other stuff appearing in the segment. I'm not yet decided on the best way to deal with this. We could mprotect() this segment after we're finished with it, if we're sure we're the only reason why it should be writable. We could skip the above and simply use a pair of mprotect() calls to do the write, at least if we don't need to add padding. We could try to move .interp into the RELRO (read-only after relocation) part of the executable. Or perhaps we could arrange that our ld.so has two .interp-like chunks, one for its own name and one for the delegated ld.so make them page-aligned, and do a remap rather than a write. This would waste some disk space in the binary, but avoid wasting memory in systems where many processes are loaded this way—a common trade-off in dynamic linking.

We could go further with track-covering, such as attempting to unmap ourselves as we hand over to the next loader. That would involve more sneaky malware-like tricks, and is overkill for most purposes.

So why is all this better than LD_PRELOAD? I can think of several reasons.

Why's it not so good?

The “but...” refers to a possible “big-hammer” way around the setuid problem: make our loader setuid root, albeit dropping root privileges super-early. That sounds crazy, but I believe it'd be an interesting demo for software verification. It only needs to be root just long enough to open the target binary and test for which UID to switch down to (either the invoking user or the file owner). The test could be done in a very short code path making probably just two system calls: one to open a file descriptor on the loaded file, and another to fstat() it. On many architectures it could be done before bootstrap relocation. Can we prove that code correct? Doing so is feasible if we have trustworthy formal specifications of instruction sets, system calls and (if needed) relocation. Not coincidentally, I've been involved in some research on these; there is more to do.

In a future post I'll demo a very simple syscall tracer built using this combination of chain-loading and bootstrapping instrumentation. It's basically an in-process strace. System call instrumentation is usually what people want when they use LD_PRELOAD, and are forced to settle for libc instrumentation instead.

Code is available (though definitely not out-of-the-box demoable; maybe by the next post) in my donald repository, and a bit more (mostly to be explained next time) in my liballocs (see allocsld) and libsystrap.

[/devel] permanent link contact

Thu, 23 Jul 2020

Building a simple toolchain extension, the subversive way

I recently ran into a nasty problem, arguably a bug, in GCC. I decided to write a little tool to help me detect when I'm triggering this bug. This post is about how I did that, using some simple but handy helper scripts I've been growing for a while.

The bug is to do with “constructor” functions in GNU C (not to be confused with C++ constructors!). I'll quote from my bug report, slightly edited.

I was surprised to find that my __attribute__((constructor(101))) on a function was
not giving it priority over another constructor whose attribute used a higher number.
It turns out this was owing to an additional prototype that did not mention the
function as being a constructor.

This caused a really subtle initialization bug in my code, and was tough to track
down. At the least, I would expect a warning when the information is discarded.

$ cat test.c
// comment these for correct behaviour, i.e. 1 then 2
static void do_init1(void);
static void do_init2(void);

static void do_init2(void) __attribute__((constructor(102)));
static void do_init2(void)
{
        printf("Init 2!\n");
}

static void do_init1(void) __attribute__((constructor(101)));
static void do_init1(void)
{
        printf("Init 1!\n");
}

int main(void)
{
        return 0;
}

$ cc -save-temps    test.c   -o test
$ ./test
Init 2!
Init 1!
$ grep -B2 init_array test.s
.LFE0:
        .size   do_init2, .-do_init2
        .section        .init_array,"aw"
--
.LFE1:
        .size   do_init1, .-do_init1
        .section        .init_array

The attribute's priority argument is being discarded. It should be visible in the
section name (".init_array.0101" etc.). If the extra prototypes are removed, the
programmer-intended behaviour is restored.

I can see how this would happen. Normally, constructor functions are not called explicitly, so don't need to be prototyped. In my liballocs project, however, sometimes my code can be called during very early initialization of the program. This is mostly because it preloads a wrapper for malloc(), which is called by the dynamic linker before it even gets around to running constructors. To deal with these incoming calls at such an early time, the library can be initialized on demand, earlier than the usual initialization sequence. To allow this, several constructors are also protoyped functions that can be called explicitly across module boundaries.

Although I could do a one-time pass over my code to fix the problem, this wouldn't prevent the same problem recurring in the future. Getting a diagnostic into GCC would be a long, slow process, and wouldn't help if I switched to another compiler that takes some other questionable interpretation of multiply prototyped constructors. So instead, I thought I'd write a simple and fairly standalone tool that can detect the problem during build. In short, I want to check that if a priority is given in a prototype, then that priority is reflected in the assembly code; if not, an error is raised.

My toolsub repository, although in somewhat rough-and-ready shape, provides a collection of scripts that let us interfere with the compilation and linking toolchain in useful ways. Most of the tools it provides take mechanisms provided by existing interfaces, such as the GCC driver's -wrapper option for supplying a wrapper script, but make them easier to use. It's tricky to use -wrapper because you have to parse various command lines, and some of these are compiler-internal commands like cc1 rather than the public cc, cpp and so on.

Firstly I want to do a pass over the preprocesssed C code, outputting a summary of any constructors that are declared with a priority. This priority information is what the compiler is discarding; I want to catch when this happens. To fish out these priorities, I wrote a short OCaml program using CIL. This post isn't about CIL, but you can see that tool here; it's about 80 lines of commented OCaml. (To build it, you'll need to use my branch of CIL, because it uses one API feature that I haven't yet contributed back to mainline.)

This tool calculates what I'll call a “generous” summary of the constructor priorities declared in the file—if you gave a priority on any prototype, it will always include a priority in its output. (Currently it doesn't help much with the case where multiple distinct priorities are used. There is also a bug about that.) What GCC does is the “mean” version: unless you're consistent, it will silently drop the priority; this is the problem we want to catch.

We could just report an error if we see inconsistent prototypes, which is the case that provokes the bad behaviour in GCC. However, I wanted to be a bit more “ground truth” than that: can we test what the compiler's output actually says about the constructor's priority? To do that, we need to know how the constructor priority mechanism works, and indeed about how the constructor mechanism works at all. It's roughly as follows.

For each constructor, the compiler generates assembly code to output a pointer to that function in the .init_array section. At run time, initialization code will walk this section and call each pointed-to function in order (this is done by the dynamic linker, if you're linking dynamically). To give constructors a priority, the compiler tweaks the name of this section so that it is something like .init_array.0123, for a constructor with priority 123. It is then the linker's job to sort these into order. The linker script rule that generates the final .init_array section is something like this.

  .init_array     :  { *(SORT_BY_NAME(.init_array.*)) *(.init_array) }

The real script is actually a bit messier than that, but the point is that for init-array sections named with a suffix, we link them in sorted order of their name, so lower-numbered sections will go first. Section named without a suffix come last.

To check that the compiler didn't discard the constructor priority, we need to find out what section name is used in the assembly code. That's pretty easy if we can run our tool immediately after the compilation step proper, i.e. the step that inputs preprocessed source and outputs assembly code. Doing that is pretty easy with the wrapper feature of toolsub. The following shell script fragment does the work; dumpconstr is the OCaml tool I mentioned, which gives the “generous” summary.

# Our check works by observing a single cc1 invocation...
# already-preprocessed source goes in, and assembly comes out.
# We cross-check each. We pass the cc1 preprocessed input
# to a CIL tool that dumps the with-priority constructors.
# Then we check that the assembly output contains the expected
# incantations in each case. If it doesn't, we flag an warning.
my_cc1 () {
    # drop the literal "cc1" argument
    shift
    # run the compiler
    "$@"
    # Now do our check... our caller has told us what infiles and outfile are
    for f in "${infiles[@]}"; do
        case "$f" in
            (*.i|*.ii)
                # use the CIL tool to find out what priorities are at source level
                constrs="$( "$(dirname "$0")"/dumpconstr "$f" )"
                # for each one, check the assembly agrees
                while read func constr prio; do
                    if  -n "$prio" ; then
                        # the symbol should be referenced in a section
                        # named ".init_array.0*%04d"
                        # where %04d is the priority, zero-padded
                        regex="^\.init_array(\.0*([0-9]+))?.*:blank:$func[[:blank:]]*\v"
                        section="$( sed 's/^[[:blank:]]*\.section[[:blank:]]*/\f/g' "$outfile" | \
                            tr '\n' '\v' | tr '\f' '\n' | \
                            sed -rn "/$regex/ p" )"
                        actual_prio="$( echo "$section" | sed -nr "/${regex}.*/ {s//\2/;p}" )"
                        if ! [ ${actual_prio:-0} -eq $prio ]; then
                            echo "Error: inconsistent use of constructor attributes on function $func" 1>&2
                            exit 1 
                        fi
                    fi
                done <<<"$constrs"
                # we process at most one input file
                break
            ;;
            (*) # skip files that don't look like preprocessed C/C++
            ;;
        esac
    done
}
CC1WRAP=my_cc1

# delegate to the generic wrapper -- we've set WRAPPER so it won't re-source the funcs
. ${TOOLSUB}/wrapper/bin/wrapper

It works by defining a shell function, here called my_cc1 which is declared to the main wrapper logic via the CC1WRAP variable. The main wrapper calls this whenever the compiler wants to compile one file to assembly code, after preprocessing. The shell function's arguments are literally the command line that the compiler wants to run, and we run exactly that command. But immediately after that, we also run our OCaml tool on the input file (preprocessed C), then scrape the relevant section names from the output file (assembly code), and check the priorities match.

The tool works! I can drop it into the Makefile of the project I'm using it in, by adding

CFLAGS += -no-integrated-cpp -wrapper /path/to/wrapper

and hey presto, my compilation job complains that my priorities are getting dropped, forcing me to fix the bug. I'm providing the compiler a wrapper script, but most of the wrapper logic is generic stuff provided by toolsub; I only had to write the shell function shown above, and the OCaml program I mentioned, to implement my tool.

So, toolsub's core wrapper logic is pretty useful, but there are at least a couple of things I'd like to fix with it. One is that wrappers should not be dealing in terms of the cc1 command, which is a compiler-internal thing. It'd be better if the core wrapper could translate such commands into “documented” interfaces like the compiler's top-level cc command line. It already does this for preprocessing, translating from a cc1 to a cpp command line (er, mostly). For the assembler it's also handled, trivially since the driver runs this directly. It's not yet done for linking, nor for the compiler proper... the shell function above still gets given a cc1 command, albeit assisted by some other logic that has pre-extracted the input and output filenames. That logic itself remains a bit rough.

The other thing I could fix is that arguably, this wrapper logic should not be using shell scripts. They're neither fast nor pleasant. But actually... it's hard to beat them for simple things... I hate how clunky subprocesses quickly become in Python, for example. In the above example, we call out to an OCaml tool to do the analysis of C source. Scraping the assembler file using regex hackery doesn't seem so bad, since the assembler is line-oriented.

My goal with toolsub is to have it become an umbrella project for small utilities, like the wrapper being used here, that help people modify their toolchain's behaviour in ways that are easy to “drop in” to existing builds. Think “prototyping and personal use”, not production; research prototypes are my particular motivation. The scripts are there to help avoid getting bogged down in compiler-specific things, such as compiler internals or details of the command line. Another of the tools in there is cilpp which again uses CIL, this time packaged to enable CIL passes to be invoked as “preprocessor plugins” (using something like -Wp,-fplugin,/path/to/cil/pass.cmxs). Injecting these as arguments is easier than using an alternative driver like cilly.

A final tool I want to mention is cccppp, unfortunately just a proof-of-concept for now, written in the very fun five days I spent at the Recurse Center in April last year. It was motivated by the lack of anything comparable to CIL for C++. I want to prototype tools that instrument C++, but I don't want my prototypes to get caught up in the fast churn of LLVM and Clang internals. So the idea is to provide a simple, fairly stable tool that can “virtualize” C++ code by lifting all the built-in operators, like pointer dereference or array subscripting, into templates that the tool invoker controls. By default, these templates just do the normal thing. But, for example, to instrument array accesses, the invoker can supply a custom definition for the built-in array access operator (as a template specialization); I'm using this to provide bounds checking. The key idea is to work using only source-level rewrites of the C++ code... a relatively stable feature of the Clang tooling API. And it's a source-to-source pass, so as long as the tool can parse your code, you can run any compiler on what comes out. Expressing these source-to-source rewrites using Clang's API is, however, pretty nasty. It still needs more work before it can work on large programs; I'm hoping it won't need loads more.

What's the lesson here? One could still argue that I should just get this problem fixed in the compilers I care about, rather than writing hacky wrapper scripts. But I tend to think the opposite: during development, every systems codebase should be compiled via a hacky wrapper script! It's a good way to add custom sanity checks for properties that your project cares about, even if most others don't. It suits properties that are quick to check, are reasonably syntactic or local to specific code (at least to specific files or functions), justify a “fail fast” error report (e.g. because it's critical to correctness), and therefore are better expressed as a static check than in a test case. Link-time invariants are another thing that liballocs could use some checking of—though sadly I haven't implemented wrapper support for the linker yet. Some other properties, like naming conventions or checking for ABI stability, arguably fit this category, depending on how strict you want to be. Anyway, we'll see if I end up using this facility more often.

After subverting your toolchain, you may also want to subvert your runtime. I'll write more about that in future posts.

[/devel] permanent link contact

Tue, 14 Feb 2017

Custom ELF program headers—what, why and how

A couple of times recently I've found myself wanting to create ELF files with custom program headers. This post explains what that means, why you might want it, why there's a practical difficulty in doing so using a standard linker, and my slightly hacky way of solving it.

Program headers are metadata within ELF files. They describe a contiguous chunk of the file's address space. There are several different kinds of program header, of which the most critical is LOAD, signifying that the relevant part of the file should be loaded (or mapped) into memory, forming a segment within the program's memory image. Besides LOAD, other program headers are used to attach particular meanings to specific ranges within the binary. For example, the DYNAMIC program header tells the loader where the dynamic-linking metadata resides. The INTERP program header identifies to the (in-kernel) loader a string naming the program interpreter (the dynamic linker, which will actually do the loading proper). Some vendor extensions exist for adding features within the loading process, such as GNU_RELRO (which records a range that can be made read-only after relocation, bringing security benefits) or GNU_STACK (which does not identify an associated range, but exists as a flag to tell the loader whether the stack needs to be made executable). If you run readelf -l on some ELF binary, you'll see a dump of its program headers. On my machine I see something like this.

$ readelf -Wl /bin/true
Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0058b4 0x0058b4 R E 0x200000
  LOAD           0x005e10 0x0000000000605e10 0x0000000000605e10 0x000404 0x0005f0 RW  0x200000
  DYNAMIC        0x005e28 0x0000000000605e28 0x0000000000605e28 0x0001d0 0x0001d0 RW  0x8
  NOTE           0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x004b4c 0x0000000000404b4c 0x0000000000404b4c 0x00023c 0x00023c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x005e10 0x0000000000605e10 0x0000000000605e10 0x0001f0 0x0001f0 R   0x1

Two reasons for wanting to add custom program headers are to add custom features to the loading process, or to make new uses of features it already provides. My first use case was an instance of the latter: to create a heap area at load time, as an additional segment, rather than the usual way of making explicit syscalls during execution. I was building ELF files for the ppcmem emulated execution environment. This doesn't implement any system calls, but does implement an ELF loader. By adding an extra LOAD program header providing an allocation arena, I could write a malloc() that does not require system calls—its heap arena is mapped at start-up by the ELF loader inside ppcmem. (Although I could have allocated space using a big char array, or by hacking the linker script to leave a big gap, a separate segment is logically cleaner, and a disjoint address range for heap allocations is helpful.)

The second use case for custom program headers comes from liballocs. When passed through the liballocs toolchain, a binary also generates extra metadata binaries which can optionally be loaded at run time to supply extra metadata similar to debugging information. Currently these files live outside the main binary, in a separate /usr/lib/allocsites hierarchy, which is a pain for deployment: if you move or copy an executable, you must move or copy its metadata as an extra step. I'd like to bundle the metadata into the binary somehow, and record its presence using a custom program header.

The way to specify the program headers is in the linker script. The problem with creating custom program headers is that a GNU-style linker's script language requires you to specify the entire set of program headers if you specify any program header at all. You can't just add one extra. Worse, the default set of headers is not even made explicit anywhere. The PHDRS command is what specifies the program headers, but the default linker script (try ld --verbose to see it) doesn't use it. Rather, it falls back on the linker's hard-coded default behaviour, which is to create all the familiar headers that we see above.

If we write our own PHDRS command we can create all the program headers we want—but we have to create them all ourselves, meaning manually creating all the ones that the linker would otherwise create for us. The set of headers created by hard-coded logic in the linker has grown over the years, to include things such as the GNU_STACK and GNU_RELRO headers I mentioned earlier. Re-creating this “standard” set of program headers is therefore quite a detailed platform-specific task, and is subject to change as new features are added; a burden we don't want to take on.

So, I came up with a way of getting the effect we want: “please create this extra program header, in addition to all the normal ones”. It is not pretty and not perfect, but it works as follows. Run the linker once, without any custom program header directives. Remember what program headers the linker generated, and what sections it assigned to them. Dump the linker script used during that link, and annotate it so that each output section is labelled with its program header assignment. Then, apply a diff to the linker script adding only the program header we care about. Finally, re-link with the diffed script.

The obvious downside is that we link twice. We're also a bit fragile to the organisation of the default linker script. Still, I'm prepared to pay these prices. I've implemented this approach with a pleasingly nasty (but small) pile of shell scripts, make rules and m4 macros. It now works pretty well for me. Here's a brief tour.

First, we have some makerules to dump the default linker script.

default-dynexe.lds: $(shell which ld)
    $(LD) --verbose 2>&1 |\
      LC_ALL=C\
      sed -e '/^=========/,/^=========/!d;/^=========/d'\
          -e 's/\. = .* + SIZEOF_HEADERS;/& _begin = . - SIZEOF_HEADERS;/'\
      > "$@"

(Thanks to Dan Williams for the sed script, which I took from his page about creating a custom dynamic loader, which is well worth a read.)

Next, we have some makerules to turn such a script into an m4'ised macro-script. Essentially, this macroisation adds two “hooks” into the script: one to include a file describing the “normal” program headers and insert any new ones; and one to wrap each output section in a macro invocation that will decorate it with its corresponding program header names. This macroisation rewrite is easily done with regular expressions.

%-dynexe.lds.phdrify.m4: %-dynexe.lds
    cat "$<" | sed 's#SECTIONS#PHDRS\n{\n\tinclude(phdrs_dynexe.inc)\n\tinclude($(srcdir)/phdrs_extra.inc)\n}\nSECTIONS#' | \
    tr '\n' '\f' | sed -r \
    's@(\.[-a-zA-Z0-9\._]+)[[:space:]\f]+(([^\{]*[[:space:]\f]*)?\{[^\}]*\})@expand_phdr([\1], [\2])@g' | \
    tr '\f' '\n' > "$@"

Next, our phdrified linker script contains output section blocks that used to look like this

  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }

now wrapped in macro invocations, like this.

  expand_phdr([.rodata], [: { *(.rodata .rodata.* .gnu.linkonce.r.*) }])

Once we have a macroised linker script, we need to m4-process it with some macro definitions that turn it back into the script we want. We need two kinds of definition for this. The first is a list of the program headers in the file as it was originally linked. The easiest way is to use readelf to dump the existing ones. As a wart: since (on my machine at least) the linker doesn't understand the full range of symbolic (named) values for the program header's type field, we use the C preprocessor to macro-expand the symbolic names down to hex literals (as defined in elf.h), which it does understand. The second kind of definition is going to expand each output section to explicitly mention which program headers it belongs to. Note that a section can be in more than one program header. I used a bash script with associative arrays, again operating over readelf output, which handily outputs the mapping from segments to section lists. The script inverts this mapping, into a section-to-segment-list mapping. This generates a single m4 macro whose body is a giant if-else chain testing the argument section name against all the sections in the file, and yielding the list of program headers that the section is in. The easiest way is to use readelf to dump the existing ones. The result is a chunk of m4-ified text that describes the program headers in the original link, looking something like this.

phdr0 6;
phdr1 3;
phdr2 1;
phdr3 1;
phdr4 2;
phdr5 4;
phdr6 0x6474e550;
phdr7 0x6474e551;
phdr8 0x6474e552;

After all that. we then simply m4-include a second file describing the custom ones, and use m4 to expand out the program header assignments. Our output section before has now become

  .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) } :phdr2

i.e. recording that the read-only data is in program header named phdr2 (the first LOAD in the dump we saw early on; this is the text segment, which also holds read-only data).

When we re-link with this script, we can add whatever extra linker inputs we want to go into our custom segment, and also apply a diff to the linker script. A simple diff looks like this, where the stuff in sections whose names begin  .insert will end up in a new segment.

--- default-dynexe.lds  2016-09-14 23:12:31.000000000 +0100
+++ insert-dynexe.lds   2016-09-14 23:13:46.000000000 +0100
@@ -187,6 +187,9 @@
   . = ALIGN(64 / 8);
   _end = .; PROVIDE (end = .);
   . = DATA_SEGMENT_END (.);
+  /* srk: new segment! */
+  . = ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1));
+  .insert  : ALIGN(0x1000) { *(SORT_BY_NAME(.insert*)) } :insert
   /* Stabs debugging sections.  */
   .stab          0 : { *(.stab) }
   .stabstr       0 : { *(.stabstr) }

The .insert output section will also be aligned to a 4kB boundary, and be placed in the program header called “insert” by the script. If the input section is allocatable, it will also go into a LOAD header. By varying this patch, you can put whatever section you like in your program header, aligned however you like.

There are some warts in all this. One is that I found increasing the number of program headers would confuse the GNU linker into placing some of its implicitly generated sections, .gnu.build-id and .note.ABI-tag, at a silly offset in the file, and that would mess up the address assignment, by causing the file header no longer to be mapped at virtual address 0. To hack around this, I inserted a one-line sed command which explicitly places the .note.ABI-tag section. Another wart was that the PHDR header was no longer generated properly: it was coming out with an address of 0, not of the executable's start address (0x400000). This turned out to be because no LOAD header was including the region of the file containing the file and program headers themselves. To hack around this, we add a little more rewriting magic to our shell scripts, to put the PHDR and FILEHDR linker attributes to any program header entry with type PT_PHDR, and also to the text segment's LOAD program header (usually the first), then updating the latter's file offset and memory addresses so that the segment actually spans that part of the file.

I mentioned that I wanted to use custom program headers to embed one ELF file (a metadata object) inside another. How can we do this? Firstly, we embed the metadata ELF file into the enclosing ELF file and mark it with a custom program header. Secondly, at run time, when we want to load the metadata, we locate the program header and get a file descriptor pointing at the embdded content. Then we need a dynamic linker that can dlopen() from a file descriptor; we can hack the existing GNU dynamic linker for this fairly easily (not shown, but I've done it; ask me). Finally though, we need to handle the fact that file offsets in the embedded content will not match file offsets on this file descriptor! So we need to translate all the ELF file offsets in the embedded file by a fixed amount. File offsets live in the ELF header, program headers and section headers. (They might also live in other sections, but I'm not aware of any.) I wrote a simple C program to add a fixed offset to these. There's also some guesswork needed to figure out what that offset should be, since the linker will be in control of the in-file placement of the extra output section. This could be made reliable using one more link pass (bringing the total to three).

There's a GitHub repository containing these recipes and examples, here. It's all a bit GNU-specific and underdocumented for now.

All this effort to achieve a fairly simple task: package one ELF object inside another so that the “inner” object is loadable when running the “outer” object. What's another way of looking at this problem? We can think of ELF files as describing an initial state of the program (or, in the case of a library, a fragment of state). What we'd like them to be able to do is specify that this state includes a given file of a particular name and contents. In other words, we'd like them to describe not only aspects of the memory state, but also aspects of the filesystem state. Linkers already do something a bit this, namely static linking: objects that might not be on the destination filesystem can be bundled into the file, albeit losing their identity as files. Thinking of binaries as “partial initial program states” in this way, there's a clear relationship between ELF executables, Docker images and other collections of memory(-mappable) objects.

Our “embedded ELF file” requirement would be trivial if segments were nameable as files in the first place. This is of course a design feature of Multics, as was dynamic linking. Although the latter has now been picked up by Unix, files and segments have still not been fully re-unified. Maybe it's time? Dynamic linking is one of two key mechanisms, along with filesystems in userspace (whether this or this) that allow us to prototype extensions to Unix without pushing our changes into the kernel. There are usually performance, security or correctness reasons why such extensions are better off ultimately residing at least partly in the kernel. But making a playground entirely in user space is valuable for research and experimentation, and is something I'm working up to. The “embed part of a filesystem inside your binary” idea is one example of the sort of feature that we want to be easily prototyped in such a system. The core features we need—serving the content, and then binding it into the process's namespace”are already provided by the OS, so it's a matter of joining them together.

[/devel] permanent link contact

Mon, 30 Jan 2017

Debugging with the natives, part 2

Some aeons ago, I wrote a piece giving a rough outline of how a native debugger like gdb works, and promised a follow-up that looks at the same pile of techniques in a more principled way. I can't really excuse the delay, but if anyone's still listening, here goes.

Source-level debugging of native code is supported on (I make it) four different different levels in a complete hardware–software stack.

We can split these four into two pairs. The first two implement machine-level primitives and their “upward mapping” (virtualisation) into the operating system's process abstraction. The second two implement the source-level view that the programmer usually prefers, again mapping upwards from binary- to source-level.

Let's take the machine-level pair to begin with. From the operating system's point of view, all debugging support is designed around the following principles. (Perhaps I shouldn't say “designed” since in reality they “grew”— and only became principles later.)

Some surprisingly strong properties result from this design. Firstly, debugging can be done from a remote process, perhaps on a separate machine from the debuggee, perhaps even a machine of a different architecture. Secondly, debugging can be done post-mortem. Thirdly, the same infrastructure works for many source languages—albeit trivially so far, since we've only seen how to get an assembly-level view. There are some contrasts here with most language virtual machines (think JVM): these implement debugging using in-VM debug servers. These can work across the network, but don't support post-mortem debugging, and typically bake in concepts from source languages.

That's enough about the machine-level part. To go from machine- or assembly-level debugging to source-level debugging, we need help from the compiler. This is designed around the following principles.

Let's do a more concrete run-through of how it works. So far I've been fairly generic, but let's fix on GNU/Linux as our modern Unix—though all ELF-based systems are pretty similar—and a familiar architecture (x86-64) and specific metadata format (DWARF).

When I compile a program with -g it means “please generate metadata”. First, let's try without.

$ cc -o hello hello.c
$ readelf -S hello | grep debug  # no output! no debugging sections

You can still debug this program, at the assembly level, because the OS debugging mechanisms remain available. It's as if the compiler-generated assembly is code that you wrote manually by yourself. You can set breakpoints, watchpoints, single step, and so on.

$ gdb -q --args ./hello
Reading symbols from ./hello...(no debugging symbols found)...done.
(gdb) break main
Breakpoint 1 at 0x40052a
(gdb) run
Starting program: /tmp/hello 

Breakpoint 1, 0x000000000040052a in main ()
(gdb) disas
Dump of assembler code for function main:
   0x0000000000400526 <+0>:     push   %rbp
   0x0000000000400527 <+1>:     mov    %rsp,%rbp
=> 0x000000000040052a <+4>:     mov    $0x4005c4,%edi
   0x000000000040052f <+9>:     callq  0x400400 <puts@plt>
   0x0000000000400534 <+14>:    mov    $0x0,%eax
   0x0000000000400539 <+19>:    pop    %rbp
   0x000000000040053a <+20>:    retq 

Now let's compile with debug information.

$ cc -g -o hello hello.c
$ readelf -S hello | grep debug
  [28] .debug_aranges    PROGBITS         0000000000000000  00001081
  [29] .debug_info       PROGBITS         0000000000000000  000010b1
  [30] .debug_abbrev     PROGBITS         0000000000000000  00001142
  [31] .debug_line       PROGBITS         0000000000000000  00001186
  [32] .debug_frame      PROGBITS         0000000000000000  000011c8
  [33] .debug_str        PROGBITS         0000000000000000  00001210

What's in these debug sections? There are three main kinds of information. Firstly, there are files and line numbers (.debug_line). These encode a mapping from object code addresses to source coordinates (file, line, column). You can dump it fairly readably, as follows.

$ readelf -wL hello 
Decoded dump of debug contents of section .debug_line:

CU: hello.c:
File name                            Line number    Starting address
hello.c                                        4            0x400526
hello.c                                        5            0x40052a
hello.c                                        6            0x400534
hello.c                                        7            0x400539

Secondly, there is frame information (often this comes out in a section called .eh_frame so I cheated a bit above; to get exactly the above with a current gcc, you should add the -fno-dwarf2-cfi-asm switch). This tells the debugger how to walk the stack. What does walking the stack mean? Roughly it means getting a sequence of stack pointers paired with their program counter values, for each call site that is active on the callchain. The old stack pointer and program counter are always saved somewhere, otherwise we couldn't return from the call. To walk the stack we start from the current “live” register file given to use by ptrace(), which holds the “end” stack pointer and program counter. The DWARF then describes how to “rewind” these register values, and/or any other registers whose job is to record the callchain (rbp on x86-64; other callee-saves are often included too) back to their state at the previous call site in the chain. The description of this unwinding is logically a table, which you can see below for the main function. The cells are expressions describing how to compute the caller's value for a register, here rbp value (frame pointer) and also the caller's program counter, i.e. the return address (given as ra). The computations are both factored into two. Firstly we calculate a “canonical frame address” from the current frame's register values (see the CFA column): it's a fixed offset from rsp and rbp, and is actually a fixed address on the stack, but the expression changes from instruction to instruction as the stack pointer gets adjusted. Secondly we obtain the saved values we want by reading from the stack at fixed offsets from that address (c-8 means 8 bytes down). This factoring helps compactness, because the CFA-relative offsets don't change when the stack pointer moves; only the CFA column needs to describe that. However, although “stored at some offset from the CFA” covers a lot of cases, sometimes more complex computations are required, which usually appear as DWARF bytecode expressions.

$ readelf -wF hello
(snip)
00000088 000000000000002c 0000001c FDE cie=00000070 pc=0000000000400526..000000000040053b
   LOC           CFA      rbp   ra    
0000000000400526 rsp+8    u     c-8 
0000000000400527 rsp+16   c-16  c-8 
000000000040052a rbp+16   c-16  c-8 
000000000040053a rsp+8    c-16  c-8 
(snip)

The .debug_info section is the biggie. It describes the structural detail of the source program along both source and binary dimensions. It has a list of source files, but also a list of compilation units. The latter is where most of the structure is. It describes functions/methods, data types, and all the language-implementation decisions that the compiler took when generating binary code: how data types are laid out, which registers or stack slots hold each local variable over its lifetime, and so on. Although not shown much in the simple case shown below, addresses of program variables are described in a Turing-powerful stack machine language which is essentially a bytecode; the DW_OP_call_frame_cfa below is one operation, which simply says “push the address of the frame base, as recorded by the frame info”. The tree-like structure of the information also describes detailed static structure of code, including function inlining, the in-memory locations corresponding to particular lexical blocks in the code, and so on. (It's worth asking whether DWARF info it might usefully bundle the source code itself. I've never seen this done, but it would make a lot of sense to me.)

$ readelf -wi hello
Contents of the .debug_info section:

  Compilation Unit @ offset 0x0:
   Length:        0x8d (32-bit)
   Version:       4
   Abbrev Offset: 0x0
   Pointer Size:  8
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_producer    : (indirect string, offset: 0x2f): GNU C 4.9.2 -mtune=generic -march=x86-64 -g -fno-dwarf2-cfi-asm -fstack-protector-strong
    <10>   DW_AT_language    : 1        (ANSI C)
    <11>   DW_AT_name        : (indirect string, offset: 0x88): hello.c
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0xb5): /tmp
    <19>   DW_AT_low_pc      : 0x400526
    <21>   DW_AT_high_pc     : 0x15
    <29>   DW_AT_stmt_list   : 0x0
 <1><2d>: Abbrev Number: 2 (DW_TAG_base_type)
    <2e>   DW_AT_byte_size   : 8
    <2f>   DW_AT_encoding    : 7        (unsigned)
    <30>   DW_AT_name        : (indirect string, offset: 0x0): long unsigned int
 <1><34>: Abbrev Number: 2 (DW_TAG_base_type)
    <35>   DW_AT_byte_size   : 1
    <36>   DW_AT_encoding    : 8        (unsigned char)
    <37>   DW_AT_name        : (indirect string, offset: 0xa2): unsigned char
 <1><3b>: Abbrev Number: 2 (DW_TAG_base_type)
    <3c>   DW_AT_byte_size   : 2
    <3d>   DW_AT_encoding    : 7        (unsigned)
    <3e>   DW_AT_name        : (indirect string, offset: 0x12): short unsigned int
 <1><42>: Abbrev Number: 2 (DW_TAG_base_type)
    <43>   DW_AT_byte_size   : 4
    <44>   DW_AT_encoding    : 7        (unsigned)
    <45>   DW_AT_name        : (indirect string, offset: 0x5): unsigned int
 <1><49>: Abbrev Number: 2 (DW_TAG_base_type)
    <4a>   DW_AT_byte_size   : 1
    <4b>   DW_AT_encoding    : 6        (signed char)
    <4c>   DW_AT_name        : (indirect string, offset: 0xa4): signed char
 <1><50>: Abbrev Number: 2 (DW_TAG_base_type)
    <51>   DW_AT_byte_size   : 2
    <52>   DW_AT_encoding    : 5        (signed)
    <53>   DW_AT_name        : (indirect string, offset: 0x25): short int
 <1><57>: Abbrev Number: 3 (DW_TAG_base_type)
    <58>   DW_AT_byte_size   : 4
    <59>   DW_AT_encoding    : 5        (signed)
    <5a>   DW_AT_name        : int
 <1><5e>: Abbrev Number: 2 (DW_TAG_base_type)
    <5f>   DW_AT_byte_size   : 8
    <60>   DW_AT_encoding    : 5        (signed)
    <61>   DW_AT_name        : (indirect string, offset: 0xb0): long int
 <1><65>: Abbrev Number: 2 (DW_TAG_base_type)
    <66>   DW_AT_byte_size   : 8
    <67>   DW_AT_encoding    : 7        (unsigned)
    <68>   DW_AT_name        : (indirect string, offset: 0xb9): sizetype
 <1><6c>: Abbrev Number: 2 (DW_TAG_base_type)
    <6d>   DW_AT_byte_size   : 1
    <6e>   DW_AT_encoding    : 6        (signed char)
    <6f>   DW_AT_name        : (indirect string, offset: 0xab): char
 <1><73>: Abbrev Number: 4 (DW_TAG_subprogram)
    <74>   DW_AT_external    : 1
    <74>   DW_AT_name        : (indirect string, offset: 0xc2): main
    <78>   DW_AT_decl_file   : 1
    <79>   DW_AT_decl_line   : 3
    <7a>   DW_AT_prototyped  : 1
    <7a>   DW_AT_type        : <0x57>
    <7e>   DW_AT_low_pc      : 0x400526
    <86>   DW_AT_high_pc     : 0x15
    <8e>   DW_AT_frame_base  : 1 byte block: 9c         (DW_OP_call_frame_cfa)
    <90>   DW_AT_GNU_all_tail_call_sites: 1
 <1><90>: Abbrev Number: 0

That's it for the tour. Let me finish with some reflections on what's good and bad about this way of doing things.

Descriptive debugging: the good

Why do debugging this convoluted, long-winded way, with all this metadata, instead of the apparently simpler VM way of doing things? In VMs, debug servers are integrated into the runtime, and offer a fixed, high-level command interface. This allows the VM's compile-time implementation decisions to stay hidden. There is no need to tell the debugger how storage is laid out, where local variables live, and so on, because the command interface is pitched at a higher level of abstraction; those details remain internal to the VM. This is convenient for the VM implementer, since generation of debugging information is onerous to implement. But it also requires cooperation from the debuggee, couples debuggers to a fixed command language or wire protocol, and presents strictly less information to the developer. While VM debuggers are designed around the abstraction boundaries of the source languages, metadata-based debugging actively enables descending through these boundaries. It is sometimes very useful for debugging tools to expose implementation details in this way. The most obvious case is when faced with a compiler or VM bug; the user would like to “shift down” to the lower level to inspect the assembly code or VM state. At other times, there are performance bugs that the developer has a hunch are about cache or paging effects; being able to see the raw addresses and raw memory contents can help here, even when the program is running on a VM.

Being highly descriptive, debugging metadata documents a large number of implementation decisions taken by the compiler, so is useful not only to debuggers but also to profilers, language runtimes (C++ exception handling is usually built on DWARF frame information), other dynamic analysis tools such as Valgrind family, and so on.

Debugging optimised code (without deoptimising)

Debugging metadata must describe optimised code. By contrast, VM debug servers typically arrange that debug-server operations only need to deal with unoptimised stack frames and at most simply-generated code (e.g. from a template-based JIT). Confusingly, even the “full-speed debugging” feature of HotSpot uses dynamic deoptimisation to get back to unoptimised code—the earlier approach was to run the whole program under the interpreter whenever you wanted a debuggable execution. In general, a debuggable VM instance must either refrain from optimisation, or know how to dynamically undo that optimisation when a debugger is attached. So, dynamic deoptimisation is not exactly “full speed”—unlike with native debuggers, execution still slows down significantly when a debugger is attached. Having the VM implement debug operations only over unoptimised code is a restriction that helps make the debug server simple, at some cost in debug-time performance.

The flip side is that VM debugging is pretty good at precisely maintaining the source-level abstraction (modulo VM bugs), without complicating the task of implementing optimisations. Meanwhile, in Unix-land, the debugging experience remains best-effort and only as good as the compiler-generated metadata, which is sometimes wrong or incomplete following complex transformations. When optimisation and debuggability are in tension, debuggability usually takes the hit, so a smooth debugging experience still sometimes relies on a separate unoptimised “debug build”. Tail call optimisations are a classic debugging-hindering optimisation, since they rely on eliding stack frames, meaning the debugger cannot know how many recursive calls are logically on the (source-level) stack. Instruction scheduling is another: the order of operations in the executed code need not match source program order, and this can make for debug-time confusion.

The control problem

Some other properties are incidental but tend to be true of current Unix-style debugging. Debugger support for exercising the target language's control features (exceptions, threads, allocations, ...) is uneven, because this can't be described purely as metadata; there needs to be some interaction with the host runtime. DWARF and similar debugging information is good at the “statics” of describing how to decode the program state, but not good at describing the protocols of interaction with a language runtime, necessar for performing operations such as spawning a thread, allocating an object or throwing an exception. These tend to be difficult unless these happen to be cleanly exposed as entry points in the language runtime. In practice debuggers usually achieve these things by having magic knowledge of particular implementations.

At least one semi-portable interface has emerged with the aim of encapsulating run-time control operations for debuggers' benefit. I'm thinking of libthread_db, best described by Joe Damato's excellent article. Unfortunately it's an abomination, because it violates the principle that implementation details are described by architecture-independent metadata. An odd but cleaner and more consistent alternative would be to bundle snippets of DWARF bytecode for doing these runtime interactions—perhaps in the debug information of a language runtime, either simply calling into the runtime (for cleanly abstracted operations) or doing something more complex. But that is only a technical possibility; there are no proposals or working demos of that as far as I'm aware (maybe I'll make one). This might sound wacky, but if you know about the early history of Java, in Oak and the Green Project, you'll see a certain uncanny similarity in these ideas.

Levels of abstraction

Debugging at multiple levels of abstraction is a neat facility of Unix-style debugging, but is also a difficult power to control. It can be useful to switch down to the assembly-level view, or to switch languages, but this capability doesn't generalise to the case where many abstraction layers are built up within the same language (think C++). The debugger will let you “continue to the next source line”, but it doesn't know how to keep you at the same abstraction level. If the next source line is deep inside a library implementing something fairly basic like a smart pointer, it will skip only as far as that line, whereas you probably wanted to stay roughly at the same level of abstraction, or perhaps within the same codebase. Things get particularly bad when there is a lot of inlining (again with C++). The traditional “step over” and “step into” are hints at this need, but are too crude.

Doing better is currently beyond the debugger's ken, but this problem could be solved: perhaps by bringing in the knowledge of libraries and source file trees that the debugger already has, or perhaps most simply by allowing programmers to manually mark out the boundaries between layers. This could be a simple partitioning over source files and binary objects, or could be something more complex, perhaps sensitive to calling context or argument values (consider the case of the same library used from two places in the same application). “Next”-style operations could then be defined in terms of these layers. I'd love to see this, although the details would take a lot of working out.

To be continued?

There's plenty of research to be done on more debuggable language implementations. This research area doesn't seem to get the attention it deserves. One problem is that debugging is usually an afterthought for researchers, even though it is essential for programmers. Another problem is that not many PL researchers have sight of the design space; they're familiar with either the Unix-style approach or the VM-style one. I hope that in the future we can figure out how to get the best of both worlds.

[/devel] permanent link contact

Thu, 25 Feb 2016

Debugging with the natives, part 1

By popular request, I'm going to run a couple of articles explaining how native Unix debuggers like gdb or lldb work. In this one, we'll take a quick cross-section of what's going on. I'll say “gdb” but I could be talking about any similar debugger.

Starting things up

When you start a program in a debugger like gdb, it does a variation of the usual fork()exec() trick, in which it sets itself up as a tracer of the child process before it exec()s the program you want to run. Alternatively, the same API can attach to a process already running. Either way, the effect is that any signals received by the “tracee” process will be delivered first to the debugger, and any system calls made by the tracee will also generate a signal. These signals can be inspected, discarded or modified as the debugger wishes, using the ptrace() API. In effect, the process (tracee) and the debugger (tracer) execute by mutual hand-off, suspending and resuming each other.

Walking the stack

Using the same API, gdb can access the raw memory image of the debugged program, and its register file. (Note that when gdb is active, the program is suspended, so its registers are saved in memory.) It can use this to walk the stack and print a backtrace: the register file gives us the program counter and stack pointer. If we have a nice simple stack which saves frame pointers, it simply walks the chain in memory. To print symbolic function names, it can use either the symbol table (if we must) or the compiler-generated debugging information (better; more on that in a moment).

Reading local state

To print a local variable, the debugger uses the same compiler-generated debugging information to look up where that variable is currently located. This lookup is keyed on the program's current program counter, to handle the way that locals get moved around between registers and the stack from instruction to instruction. The result of the lookup might be a memory location, a register number, or even, sometimes, something composite like “first 8 bytes in register X, rest in memory location Y”.

Setting breakpoints

To set a breakpoint on a particular source line, gdb uses a different part of the debugging information which encodes a mapping between binary locations (program counter values) and source file/line/column coordinates. A small subset of program counter values are specially identified as “beginning of statement” instructions; these are normally the ones the user wants to break on. Sometimes, hardware debug registers can be used to implement breakpoints (up to some maximum); the CPU generates a trap on reaching the program counter value stored in the register. More classically, the “soft” method is for the debugger to overwrite the instructions with trap instructions, saving the old instruction. This trap will cause a signal to be generated when the program executes it.

Resuming from breakpoints

To resume from a software breakpoint, a bit of juggling is required. Recall that we overwrote the original instruction. We can replace it, then ask the OS to single-step the program (using the hardware's single-step mode), then re-set the trap. This is racy in multithreaded programs when the other threads aren't stopped. So instead a good debugger will emulate the instruction itself! This means using an internal interpreter for the target instruction set, backed by the memory and register file access that it already has via ptrace().

Conditional breakpoints and expression evaluation

If we have a conditional breakpoint, we also need to be able to evaluate expressions when we take the trap (and silently resume if the expression is false). For this, the debugger has a parser and interpreter for each source language it supports. These interpreters are very different from an ordinary interpreter: the expression's environment comes from the binary contents of the debugged program's image, rather than being a structure the interpreter itself maintains. (The interpreter keeps a small amount of local state, such as for subexpression results, but it's mostly in the debuggee.)

Calling functions

To evaluate expressions containing function calls, we need to be able to run code in the debuggee. To do so, we craft a special frame on the debuggee's stack, a safe distance away from its actual top-of-stack. Having evaluated the arguments as usual, we put them in the relevant registers (saving what was there), tweak the saved stack pointer to point to this crafted state, set the instruction pointer to the callee, and set the debuggee going again. The function runs as normal, and on return it uses an on-stack return address that the debugger supplied. This points at a breakpoint-like instruction that raises a signal, returning control to the debugger. This then cleans up the stack as if nothing happened, and resets the registers accordingly. (Disclaimer: I haven't actually dug through gdb's code here, so I might have one or two things subtly wrong. At least, the above is how I'd implement it. Let me know if you know better.)

It goes on...

There's lots more, but if you understand the above, you've probably got the gist.

All that seems like a big bag of tricks. What are the principles at work in this design? Although the system has “grown” rather than being “designed”, and has a generous share of quirks and unstandardised corners, there are some surprisingly strong principles lurking in it. I've written a little about these in my paper at Onward! last year. Next time, I'll cover briefly some of the same observations, and a few others, hopefully thereby re-explaining how debuggers work, a bit more systematically .

[/devel] permanent link contact

Thu, 16 Jul 2015

ELF introspection, robustly and portably

I often find myself writing code that does introspection at the ELF level. This is for things like looking up symbols, walking the link map, figuring out which shared object a particular memory address belongs to, which memory regions are executable, and so on. These are fairly common things to want to do in ptrace()- or LD_PRELOAD-based tools, or binary interpreter-like (Valgrind-style) ones. This post is about how to do these things robustly from within a process, where “robustly” brings some particular restrictions.

What introspection?

What does our address space look like? To spare myself the pain of drawing diagrams, we can ask Linux to dump a somewhat-pictorial view of a simple cat process by doing cat /proc/self/maps. I have reversed the order of the lines so that higher addresses go higher up, and tweaked the formatting slightly.

          address range           flgs  offset   dev   inode                  pathname or other name   
------------------------------------------------------------------------------------------------------------
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

    7fff301fe000-    7fff30200000 r-xp 00000000 00:00 0                  [vdso]

    7fff301a5000-    7fff301c7000 rw-p 00000000 00:00 0                  [stack]

    7f4a574bc000-    7f4a574bd000 rw-p 00000000 00:00 0 
    7f4a574bb000-    7f4a574bc000 rw-p 00023000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a574ba000-    7f4a574bb000 r--p 00022000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57493000-    7f4a574ba000 rw-p 00000000 00:00 0 

    7f4a57298000-    7f4a572bb000 r-xp 00000000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57293000-    7f4a57298000 rw-p 00000000 00:00 0 
    7f4a57291000-    7f4a57293000 rw-p 001bf000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5728d000-    7f4a57291000 r--p 001bb000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5708e000-    7f4a5728d000 ---p 001bc000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56ed2000-    7f4a5708e000 r-xp 00000000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56c09000-    7f4a56ed2000 r--p 00000000 08:05 2750795            /usr/lib/locale/locale-archive
        00838000-        00859000 rw-p 00000000 00:00 0                  [heap]
        0060c000-        0060d000 rw-p 0000c000 08:01 286233             /bin/cat
        0060b000-        0060c000 r--p 0000b000 08:01 286233             /bin/cat
        00400000-        0040c000 r-xp 00000000 08:01 286233             /bin/cat

This picture is a refinement of the usual OS textbook diagram of a Unix address space. As usual, we have the executable's text segment (bottom), read-only and then writable data (next two up), heap (next up), a stack (much higher) and the kernel region (up top). (Actually the kernel region is not shown, except for one small part, the vsyscall page, which is the only user-accessible area.) The biggest difference is that we have not only the executable but some libraries—primarily the C library (since cat doesn't use others), but also the dynamic linker (itself a library) and the mysterious vdso. All these are ELF objects. In general, a large subset of a process's structure is actually defined by the collection of loaded ELF objects and a little per-process structure also defined by ELF. Our problem is now: how can we acquire a view on this ELF-level structure, from within the program itself at run time, efficiently and robustly?

(The above listing also shows a few other flavours of mapping that won't concern us here: a memory-mapped file /usr/lib/locale/locale-archive, and two anonymous mappings, beginning at 0x7f4a57493000 and 0x7f4a57293000, which are almost certainly heap areas—we can have many heaps, not just the one that grows out of the executable's data segment. And of course, although we see only one mapping labelled [stack], in a large multithreaded process there will be many such.)

What's ELF?

The basics of ELF linking were first defined in System V Release 4 in 1988, and eventually adopted by non-commercial Unices starting in the mid-1990s (1995 for Linux, with kernel 1.2; 1998 for FreeBSD, with version 3; NetBSD and OpenBSD in 2001 and 2003 respectively). The “classic” picture of a loadable ELF file (direct from the spec) looks like the following.

Execution View
ELF Header
Program header table
required
Segment 1
Segment 2
Segment 3
...
Section header table
optional

But actually, that omits some of the most interesting stuff! Let's try again.

Execution View
ELF Header
Program header table
required
Dynamic linking metadata
Dynamic symbol table
Dynamic relocation table
Segment 1
Segment 2
Segment 3
...
Section header table
optional
Symbol table
optional

Dynamic symbols, program headers and the miscellaneous dynamic linking metadata are what we're interested in introspecting on. The POSIX libdl API offers one interface for doing introspection at the ELF level, at least for looking up symbols by name. Non-portable extensions, like SunOS/Solaris's dladdr() and the later-added dladdr1(), together with GNU's dlvsym() and dl_iterate_phdr(), get us a bit further, walking program headers and looking up symbols by address or version. We get further still if we allow ourselves OS-specific features, like Linux's helpful maps file in the /proc filesystem that generated the dump above, together with the auxv file or its associated library call getauxval(). If we're happy to parse the ELF file itself, by doing file I/O, we can get at anything we like.

What's “robustly”?

There are problems with all of the preceding approaches. Portability is lacking from many—only the POSIX APIs and ELF file I/O are widely portable. Robustness is lacking from all of them, in the sense that it may not be safe or reliable to use them from “sensitive contexts” where ELF introspection may be required. The main kind of “sensitive context” is one where making system calls or library calls would cause unwanted reentrancy. For example, if we want access to metadata from instrumentation logic that runs in the middle of a malloc() call, calling out to a library which might itself call malloc() is a bad idea. If we're instrumenting the code paths that make system calls (ask me how!), calling out to the OS might also be a bad idea. These unintended reentrancies can easily cause infinite loops, deadlock or state corruption. There are also security reasons why system calls and library calls are less reliable than working entirely within the process: the filesystem might have changed since the filenames in our maps or link map were recorded, for example.

So, how much ELF introspection can we do without the help of system calls, library calls or non-portable code? The answer surprised me: quite a lot, but not quite everything. There is some low-hanging fruit, then some that we can just about reach (provided a couple of assumptions which, although not strictly portable, probably hold for most platforms), then some that absolutely no portable method can let us get (without resorting to file I/O).

Let's state our problem slightly more precisely. Our address space includes a selection of objects (the executable and all loaded libraries), mapped at particular addresses. We'd first like to enumerate these mappings, giving us a look-up from load addresses to the object's originating binary file (pathname)—more-or-less the structure we saw in the /proc/self/maps listing earlier. Furthermore, each object defines some symbols. We'd like to be able to resolve symbols in each object, as well as any other useful other metadata, like their soname, the libraries they depend on, the locations of support structures like procedure linkage table (PLT), and the relocations performed when they were loaded. This is what we saw in the elaborated ELF file picture. The symbols-and-addresses stuff is the most important; the relocation and miscellaneous information isn't useful as often, but we'd like to have it anyway. Finally, there's some useful metadata that the operating system has passed down about the process, in an ELF-defined structure called the auxiliary vector. It turns out that we'll use this to get at some of the other metadata, but some other parts of it are useful in its own right—they tell us the process ID, page size, and various properties of the hardware.

The link map

To enumerate the mapped objects, we need access to a structure called the link map. The portable interface for walking the link map is the mysterious r_debug protocol. This is surprisingly murky in its origins, but it probably comes from SunOS; it was definitely included in System V release 4 (the only solid reference I know is the relevant Programmer's Guide). It's actually not specified in most ABIs, although some seem to do so. The r_debug protocol defines a structure containing pointers to the link map. From an executable, it's easy to find this structure: we usually have a DT_DEBUG entry in PT_DYNAMIC, and the dynamic linker updates to points to a r_debug structure it allocates. This contains a pair of pointers to the link map, represented as a doubly-linked list.

What about from a shared library? Getting a pointer to the r_debug is trickier here, because shared libraries don't usually have a DT_DEBUG entry. This is a chicken-and-egg problem: we need the executable's r_debug to get to other objects, but if we're not starting in the executable, how do we get there? So here comes our first somewhat-non-portable assumption: that a global symbol _r_debug is available, and marks the r_debug structure in the dynamic linker. This symbol is commonly present, but I don't believe it's standard—POSIX generally specifies programmer-facing APIs, not linker-level symbols. (As we'll see shortly, another method for getting at it is available, but it is also non-portable and imposes some other requirements. Anyway, by combining these two options, we have a pretty good chance that we can find the link map. So, let's continue on the assumption that we have the link map.)

Symbols

Once we have the link map, we'd like to do symbol lookups in each object. For this, we need (at least) the dynamic symbol tables, which we can find via the PT_DYNAMIC segments of each object. (We probably also want the symbol hash tables; the method is similar.) The link map helpfully keeps a pointer to this specific segment (rather than to the overall program header table). The segment's content consists of key-value pairs, and it's the pair whose key is DT_SYMTAB that points to the dynamic symbol table. (On most ABIs, a short cut to get a pointer to the containing object's PT_DYNAMIC segment is to make a link-time reference to the symbol _DYNAMIC. But this only lets us get the containing object's PT_DYNAMIC, whereas walking the link map gets us all of them.)

One caveat about symbol lookups is that only dynamic symbols are looked up. The other symbols, if there are any (not needed for dynamic linking, but useful for debugging), live in a different table, .symtab, which needn't be mapped at run time. Depending on how your object was linked, perhaps most symbols became dynamic symbols (--export-dynamic) or perhaps only a minimal selection did. If you've ever tried to look up stat() in the GNU C library, using dlsym(), and been surprised not to find it, this is why. The stat() function is one of a few that are always linked statically, even when the rest of the C library is linked dynamically. In general, not all code or data in our dynamic objects is actually described by dynamic symbols. I'll return to this shortly.

The auxiliary vector

The link map gave us a list of loaded objects and their base addresses, but it doesn't tell us where each individual segment is mapped. This is where we begin our voyage into the not-really-portable. In the case of the executable, we can get the program headers via another useful metadata structure: the auxiliary vector. This is created by the operating system to supply various information of use to startup code (typically the bowels of the C library) and the dynamic linker (which, if one is used, also mutates the metadata to make it look as though the executable was loaded directly). (If we failed to get the link map using the previous _r_debug symbol method, we can also use the auxiliary vector to get to the executable's DT_DEBUG, and hence to the link map.) Here's a very nice article all about auxiliary vectors.

The auxiliary vector lives high on the initial thread's stack, a fixed offset above the stack pointer when the program's first instruction receives control. If we could walk the main thread's stack right to the top, we could find the auxiliary vector easily. However, nowadays, since we lack frame pointers, that means making a library call to something like libunwind, and that might allocate memory. Even then, it's not guaranteed that we can walk the stack all the way to the top, since unwind information might be patchy up there. So we'll need a different approach.

I devised a nasty but ultimately pleasing hack for this. When we say “auxiliary vector” we really mean specifically a list of key-value pairs containing the useful metadata I mentioned. But there's also a bunch of other things up there, in a large contiguous structure (see the helpful article for a diagram): the environment strings (a blob of characters), argument strings (ditto), and vectors of pointers into these blobs (our initial environ and argv). Instead of walking the stack, can we get hold of a pointer into one of these blobs, and then parse the surrounding memory to find the start of the key-value pairs? It's very likely that we can, since we can expect to find a handy global variable, environ, that points into them.

Now, it's important to note that this isn't completely robust. The auxiliary vector only records the initial environment. A program can modify the current environment strings to its heart's content by using C library calls like putenv() and getenv(). These will mutate and, if necessary, reallocate the vector pointed at by environ. However, unless a program deletes its entire initial environment, and assuming the C library doesn't eagerly copy things out of it, that vector should always contain one or more pointers into the initial environment blob. So, we want to walk environ and look for such a pointer.

The next problem: how will we recognise such a pointer when we see it? For this, we need to fix some more very-likely-true assumptions. Firstly, we assume either that the initial stack is the highest user mapping in the process, or that if something is higher, it's a shared object (perhaps Linux's vdso) which we can identify in the link map. (Put differently: what's not allowed is for there to be a memory-mapped file above the initial user stack, other than a shared object. This doesn't seem too restrictive; we could even imagine an ABI-level guarantee from the kernel that it will never create such mappings.) This assumption gives us an upper bound to compare our environment pointers against. What about a lower bound? For that, we assume that the caller can supply us one, in the form of a pointer higher than any other stack-allocated string in environ. The address of a local variable in main() could be one such pointer. In fact, any address on the initial stack will probably be good enough, since putting a stack-allocated string in your environment would be a very quirky thing to do. Anyway, we suppose that our caller, the client code which wants to get hold of the auxiliary vector, can give us a such a pointer.

Now we're ready to go. We scan *environ for pointers to strings within these bounds. Then we navigate downwards in memory to find the end of the auxiliary vector, then keep going to find its beginning, after which we'll see the terminator of the argument or environment pointer vector.

Once we have the auxiliary vector, we have the executable's program headers, from which it's easy to get at most other things we need, again using the link map to access other objects.

Program headers

One challenge remains for us: getting at shared objects' program headers. The problem here is that these headers needn't, in principle, be found anywhere in memory. (The ELF format allows a file to require that its program headers are mapped, using PT_PHDR, but it's not required and its use is uncommon.) Typically a dynamic linker will always keep each object's program headers in memory somewhere, but it's not obliged to do so. Even if it does, that memory might be anywhere. The GNU dynamic linker sometimes stores just a malloc'd copy of the data, perhaps originally read through a file descriptor rather than a memory mapping. Either way, the pointer it maintains to the header data lives in private state (in extra non-public fields in the link map structure); there's no portable access to it. So, to get shared objects' program headers there's no avoiding the use of some platform-specific (and loader-specific) code to fish it out.

The major limitation of doing without program headers is that we don't have a complete map of the address space. The link map lets us enumerate objects and their start addresses, and we can then use the symbol tables to do a “forward lookup” from symbols to addresses. We can invert the symbol tables to give us a partial backward lookup from addresses to symbols. But we can't do quite as complete a backward lookup as the dladdr() function can. If you give dladdr() an address that falls within an object's mappings but isn't covered by an exported symbol, it can still tell you which object it's in. Only the program headers contain the information necessary for this. Another thing we can't figure out is which memory access permissions a segment was mapped with—again, that means looking at the relevant program header.

Maximising ELF introspectability

Armed with these observations, we could imagine dynamically rewriting the binaries on our system slightly to optimise for ELF introspectability. Firstly we would insert a PT_PHDR, and define a symbol (maybe _PHDR), roughly analogous to _DYNAMIC), to help us find it. Secondly, going back to the restriction that only dynamic symbols were available for lookup, we could export all .symtab entries as “hidden” dynamic symbols. The obvious objection to this is that it will slow down linking. However, symbol lookup by name happens via the hash table (DT_HASH or GNU_HASH). I haven't checked all the details yet, but it appears that adding them to the symbol table needn't necessarily mean including them in the hash table. There has to be a hash-chain “next pointer” per symbol, but it's permitted not to hook that entry into any bucket. Since link-time lookups occur via the hash table, leaving the hidden symbols out of the hash table seems to defeat the objection. They'd be present for a “deep” introspector to find, by scanning the symbol table. But they wouldn't interfere with the symbol lookup process, nor with code using dlsym().

Code

If you'd like to use this introspection goodness, I've coded it all up in a single header, relf.h, which is fairly self-contained and should be self-explanatory. Please let me know if you find it useful!

[/devel] permanent link contact

Wed, 19 Nov 2014

How to write vaguely acceptable makefiles

It's not hard to write a clear, orderly makefile if you respect it as a programming job in its own right, and adopt some principles. I rarely see such principles written down. Here's what I've gravitated towards as I've learnt to use make more effectively.

[/devel] permanent link contact

Tue, 01 Jul 2014

Drop-in debugging

If you're debugging a problem that occurs in some process deep in a process tree, getting a debugger onto that process can be hard. I recently wrote a script called gdbrun that can help with this. The idea is that for any command cmd args, you can do gdbrun cmd args and it will run the program under a debugger, while externally behaving just like the original program. The exception is if the program doesn't terminate properly: in that case the debugger sticks around for you to interact with. (If you need a more complex condition than “doesn't terminate properly”, you can code that into the script too.)

This isn't ideal—say, if you only want to debug only one invocation of that binary among many. But it works in many cases. (I'm also working on a solution for the harder case, but one thing at a time.) The script is deceptively simple. Sadly it needs GNU bash, but for any non-GNU people it should be easy enough to port.

#!/bin/bash

exec 7<&0
exec 8>&1
exec 9>&2

quote () {
    sed "s/\([\`\\\\\\\"\\\$]\)/\\\\\\1/g"
}

exe="$1"
shift
argstring=""
ctr=0
while true; do
    if [[ -n "$1" ]]; then
        # don't use 'echo' to generate input for quote, otherwise "-e" will be disappear'd
        argstring="${argstring:+${argstring} }\"$( cat <<<"$1" | quote )\""
    fi
    ctr=$(( $ctr + 1 ))
    shift || break
done

exit_status_file=$(mktemp)

# can add -hold here to help debugging
xterm -e gdb \
    --eval-command "file $exe" \
    --eval-command "run $argstring </dev/fd/7 >/dev/fd/8 2>/dev/fd/9" \
    --eval-command "set logging file ${exit_status_file}" \
    --eval-command "set logging on" \
    --eval-command "print \$_exitcode" \
    --eval-command "quit \$_exitcode" </dev/null

inferior_status=$(cat ${exit_status_file} | sed 's/.* = //' )
exit ${inferior_status:-0}

What was hard? Firstly, it took some head-bashing to realise that sharing a terminal between the debugger and inferior wasn't going to work. That's why the script uses xterm—popping up terminal windows for an instant is a minor annoyance, but tolerable. I can't say for certain that a shared-terminal solution isn't possible, but the usual trick of stty -tostop wasn't enough to stop child processes hanging from contention for the terminal, so I gave up.

The second hard thing, and the hardest, was realising that I could thread duplicated file descriptors right down to the inferior, and open them with /dev/fd/7 and friends. This was non-obvious to poor old me. I first tried a solution with named pipes and cat processes forwarding data from/to the gdbrun script's streams to the inferior's streams. After much gnashing of teeth, this turned out to be unworkable. The reason is that pipes are asynchronous channels, meaning they're buffered. A consequence is that even if the inferior doesn't read from its input, some input data will be buffered in the pipe when it gets opened (actually it's opened by the shell that launches the inferior, using gdb's run <infile syntax), and then discarded when the inferior dies. So if you have a command like echo blah |(gdbrun true; cat) your blah will get gobbled into oblivion. The moral of the tale is that introducing cats and pipes is a recipe for mischief.

[/devel] permanent link contact

Sat, 23 Feb 2013

A curiously recurring explanation

In conversation recently(-ish), I tried to explain the infamous Curiously Recurring Template Pattern (CRTP) of C++ by relating it to mixins. That just turned one surprisingly tricky problem into two. Here I will try to rectify both problems by proivding a somewhat coherent explanation of how to realise mixins using CRTP.

/* defining a mixin */
template <class Instantiator>
class my_mixin
{
	int some_state;
public:
	void invoke_feature() { /* using some_state, somehow */ }
};

/* "mixing in" a mixin */
class built_from_mixins
 : public 
      my_mixin<built_from_mixins> /* , other mixins ... */
{
    /* ... */
};

Unlike normal base classes, a mixin is designed to attach at an arbitrary point in the subtyping hierarchy. It's an orthogonal feature that can be “mixed in” to many potential classes, rather than an increment on a particular class. Therefore, rather than referencing a specific base class, a mixin leaves this reference unbound. Nevertheless, it still gives a name to the class it is extending. Here it is called Instantiator. This can be thought of as a “forward reference” to the class that is “mixing it in”. The name is a placeholder for the class that the programmer will extend using the mixin.

(There are other variations on this theme in other mixin-like constructs. Anyone interested in the many meanings of “mixin” could do worse than to start with Richard Gabriel's essay which is based around this subject— though I note that its actual point about incommensurability is deeper, completely distinct, and very interesting!)

Looking at the code, we see there is a cyclical reference chain: from the mixin user built_from_mixins to a specialisation of the mixin itself my_mixin<built_from_mixins> and back (via Instantiator, which is instantiated to built_from_mixins). These references are a bit strange. We haven't even defined built_from_mixins at the point where we use it to parameterise our mixin instance. Why does the compiler even allow this?

The answer is that of course it's allowed, and the compiler allows it simply by virtue of the usual C++ (and C) rules about incomplete types. Cyclic reference among data type definitions is not unique to mixins. Consider a linked list, where it's no problem to create a recursive “next” pointer inside the list node type. Since pointers-to-incompletes are allowed, and the list node type is just another incomplete type at that location in the code, the code compiles fine.

It takes only a small generalisation to apply this not just to incomplete pointer target types, but more generally to incomplete types used as template parameter instances. Of course we can refer to built_from_mixins inside its own inheritance clause—but we can only do things that we can do with an incomplete type. Using it as a template parameter is one such thing—so long as the template's definition is consistent with its parameter being incomplete. In particular, possible usages of Instantiator inside my_mixin, above, are limited if we want to use the incomplete my_mixin as our Instantiator: we can only do the things we can do with any other incomplete types inside a class definition. Happily, my_mixin's definition sticks to this regime, so is well-formed. Moreover, it itself is a complete data type! (Similarly, if you defined a mutually recursive pair of data types using pointers, whichever one of them you defined first in your source file would be complete immediately, even though it contains a pointer to the yet-to-be-defined second data type.) Being complete, our instantiation of my_mixin is fair game for deriving from. This is what allows us to derive build_from_mixins from my_mixin<built_from_mixins>: the latter is complete, even though its type parameter built_from_mixins (known as Instantiator inside the mixin) isn't.

In fact, we haven't used Instantiator at all inside my_mixin. So, why include it? What can we do with it? Well, we can safely use it in any way we can use any other incomplete type: as a pointer (or reference) target type, or as a type parameter. An example is boost's enable_shared_from_this, a mixin which adds the necessary state and member functions for allowing a class to provide a shared_ptr version of its this pointer. You can't safely create a shared_ptr from a regular pointer in general because you don't know where the target object's reference count lives. The enable_shared_from_this mixin fixes this by embedding a pointer to the refcount, in the guise of a weak_ptr subobject, into the mixing-in class. The guts of enable_shared_from_this are basically as follows.

template<class T> class enable_shared_from_this
{
private:
    mutable weak_ptr<T> weak_this_;
public:
    shared_ptr<T> shared_from_this()
    {
        shared_ptr<T> p( weak_this_ );
        return p;
    }
};

Just as in our first example, we have some private state and a public interface which implement an orthogonal feature that can be “mixed in” to any class. The mixin-instantiating class T is referenced only in an incompleteness-tolerant way, to instantiate other templates and (eventually, inside the definition of weak_ptr, which is not shown) to define a pointer target type.

I've also seen CRTP described as “virtual functions without polymorphic behaviour”. What does that mean? Since our mixin has a reference to its instantiating class, it can call its methods—even though the binding to that specific class has not yet been formed. In other words, we have deferred the binding of methods—but not until run time. Rather, we have deferred them to later in compile time, when our templates are elaborated. Let's look at an example.

Let's try to get this deferred binding without using CRTP, but also without using virtual functions. Unsurprisingly, it doesn't work. The best we can do is to try non-polymorphic inheritance, by writing something like this.

class X
{
public:
	void f() { /* unimplemented */ }
	void g() { f(); }
};

class Y : public X
{
public:
	void f() { cout << "Done something"; }
};

Of course, if we call X::g(), it calls X::f() and not G::f(). Using CRTP, we can get it to call G::f() without resorting to virtual functions.

template <class Derived>
class X
{
public:
	void f() { /* unimplemented */ }
	void g() { Derived::f(); }
};
class Y : public X<Y>
{
public:
	void f() { cout << "Done something"; }
};

CRTP allows the base class to leave certain functions undefined, for definition later in many possible derived classes. The derived classes are not derived the usual way, though: they are derived using CRTP, passing the derived class back to the base class as a type parameter.

This sets up a now-familiar kind of cyclic reference: the base class refers to its (potentially many) derived classes, through its template parameter. Having this “forward” reference, down the inheritance hierarchy, as well as the usual backward reference up it, is what allows derivation-time binding. It's also limiting: we can't subsequently “re-override” Y::f(). Y's method bindings are fixed at derivation time. We have to create a new specialization of X and derive immediately from that, using some other means to get at Y's functionality if we need it.

Interestingly, note that it's okay for us to do Derived::f() in our X class template. This is surprising because at the point where we instantiate X, Derived is an incomplete type. I mentioned earlier that we were restricted in what we could do with incomplete template parameters, but in this case, here we are happily calling a method of ours. At definition time, there are at least two possibilities for the code that must be generated at the site of our call to Derived::f(), because f() might be an instance member function or a static. (It could also be a function object overloading operator().) When we instantiate X, if we read the code strictly top-to-bottom, we haven't yet got to the body of Y, so it is not yet decided whether it will define f() as a static or an instance member function. Somehow, the compiler is examining looking ahead at the definition of Y at the point where it elaborates X<Y>, even though Y cannot be complete yet (because we're in the middle of elaborating its base class). I must confess, this is where my understanding of the C++ language runs out. Thte standard contains complicated rules about name lookup in templates—principally the infamous “two-phase name lookup”. In short, “unqualified names” in templates are looked up at template definition time, whereas “qualified names” are looked up at instantiation time. Clearly our use of Derived::f() is a qualified name. No doubt there is some part of the standard detailing how the second-phase lookup is permitted (and required) to look into an incomplete type, namely Y in our example (incomplete at the time of X's instantiation), to understand the meaning of Y::f(). I haven't yet found a reference to it though. If anyone can point me to it, I'd be much obliged.

[/devel] permanent link contact

Wed, 27 Jun 2012

32 bits should be enough for anyone

For a brief while, 32-bit Intel machines were the de facto standard in commodity hardware , and life was simple. Certainly, it's an ugly architecture, gigantically overcomplicated by backwards-compatibility. Its virtual addressing features are terrifying. But the subset of it which user-level programmers on modern OSes use is fairly comprehensible. There is an ISA-defined stack with its own registers and a well-defined calling convention. Pointers and most integers are both 32 bits, meaning that the “word” is a useful and well-understood unit of storage.

All this changed in the early 2000s as AMD's 64-bit extension of the ISA came into popularity. Upon us were forced bigger integers, bigger pointers, and a more complicated stack and calling convention (in the name of “performance”, but at huge cost in complexity). I believe these were completely retrograde steps. Now that pointers are 64 bits, our software's memory footprint and disk footprint are bloated considerably. To “alleviate” this, and to avoid certain paradoxes about the size relationships between short, int and long, an int in most C compilers stayed at 32 bits. Unfortunately, this is completely braindead, because int is very much supposed to be a word-sized type. This is the reason that C's “defaults to int” semantics, as applying to unprototyped functions and untyped variables, are sane.

Does this matter? Yes! Here is some code that was mysteriously segfaulting for me this morning. It's from DTrace, or more specifically, Paul Fox's Linux port of it.

if ((P->rap = rd_new(P, P->pid)) != NULL)
  (void) rd_loadobj_iter(P->rap, map_iter, P);

Somehow, the pointer returned by the rd_new---which just wraps a simple calloc() call---gets corrupted immediately after returning. Suspiciously, said corruption is that the top four bytes are 0xffffffff, whereas the lower four bytes are those of the pointer returned by calloc(). Inspecting the disassembly around the rd_new call, we see something suspicious.

   0x000000000047ed12 <+150>:   callq  0x462bc6 <rd_new>
=> 0x000000000047ed17 <+155>:   cltq   

What's this cltq thing? Well, it takes the lower 32 bits of %rax (the 64-bit register holding the return value from rd_new()) and sign-extends them to fill the full 64 bits. This is exactly the corruption I was seeing. Why did the compiler insert this unwanted instruction? The answer is revealed if we recompile the file with -Wall.

Psymtab.c:645:3: warning: implicit declaration of function `rd_new' [-Wimplicit-function-declaration]

The function is not prototyped, so its return value defaults to int. But because int is now 32 bits wide, and the register holding the return value is 64 bits wide, the compiler helpfully obliterates the top 32 bits of the return value for us by sign-extending the lower 32 bits into them. If the compiler implementors had stuck with the intention of the int data type, that it be exactly a word in size, and therefore that “defaults to int” is sensible, this would not have arisen.

Now, this is clearly sloppy code. We should just fix it so that rd_new() is prototyped. It probably seems a bit of a nonsequitur that I am blaming this problem on 64-bit architectures. But on the other hand, how often in your code have you actually wanted integers that can store values in excess of 232? If you are a systems programmer, you might value the ability to encode large offsets. But we already had long long for this. In other cases, the vast bulk of our code deals with small integers, characters and pointers. Giving us an extra 32 bits of width in our ALU operations is a waste of transistors.

Why did we waste them this way? Well, we had to waste them somehow. In the early 2000s, we didn't really know what else to do with them, because (I suspect) there was little perceived demand for multiple cores in the commodity market (outside of servers). Nowadays, we have even more transistors, and even hardware guys realise that giving us 128-bit architectures would be a pointless waste. So, they spent some effort convincing us that we really did want multiple cores after all. And now we are busy complicating our software so that we can “exploit” these too. I have ranted before about how that risks generating a generation's worth of bad software. Perhaps I should say “another generation's worth”.

By the way, I'll concede that 64-bit address spaces can be useful, if they are used to support persistence or sharing. No need for pointer swizzling! But AMD's 64-bit x86 extensions do not provide the separation of protection from mapping to realise the sharing use-case. In other words, switching protection domains still means invalidating the TLB entries of shared mappings. Meanwhile, I haven't seen anyone using the extra address space for accessing persistent storage in a radically new way, although I'd love to see some approaches in this space.

I don't completely doubt the value of multiple cores either. The right way to see parallelism is as an enabler for radically more computation-intensive applications---likely to be in domains such as scientific computation, machine learning, and simulation---than what we can currently support. As I have also ranted about before, I am deeply disturbed by the fervour for mass rewriting of everyday software, and the disruption to the infrastructure it runs on, that is resulting from mindless multicore mania, in the same way that the 64-bit shift has disrupted our infrastructure. It's all in the name of performance, but it costs us far more of human beings' time and energy than it saves.

[/devel] permanent link contact

Fri, 01 Jun 2012

Link order

Initialization order of static state is a thorny problem. It's particularly tricky to get right portably. But until recently I didn't realise how tricky it could be even when restricting oneself to GNU tools on Unix platforms. Consider the following three-part program, consisting of an executable prog and two shared libraries lib1 and lib2. The dependency order is left-to-right in that list: prog depends on lib1 which depends on lib2.

/* prog.c */
#include <stdio.h>

/* from lib1 */
void greeting(void);

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
{
	fprintf(stderr, "Initializing prog\n");
}

int main(void)
{
	greeting();
	return 0;
}

/* end prog.c */

/* lib1.c */
#include <stdio.h>

/* from lib2 */
void hello(void);

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
{
	fprintf(stderr, "Initializing lib1\n");
}

void greeting(void)
{
	hello();
}

/* end lib1.c */

/* lib2.c */
#include <stdio.h>

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
{
	fprintf(stderr, "Initializing lib2\n");
}

void hello(void)
{
	printf("Hello, world!\n");
}

/* end lib2.c */

Here's a GNU Makefile to tie it all together.

CFLAGS := -g -fPIC
LDFLAGS := -L$$(pwd) -Wl,-R$$(pwd)
LDLIBS := -l1 -l2

default: lib1.so lib2.so prog

%.so: %.c
	$(CC) $(CFLAGS) -shared -o "$@" "$<"

clean:
	rm -f lib1.so lib2.so prog

Now when you do make (or gmake) it will build a program that initializes its libraries in right-to-left order: from the “most depended on” to the “least depended on”. We can verify this by running the program.

$ ./prog
Initializing lib2
Initializing lib1
Initializing prog
Hello, world!

Moreover, if you try flipping around the link order in the LDLIBS line, the link will fail with undefined reference to `hello', because the reference to hello (in lib1) is introduced after the reference to lib2, and the linker's defined behaviour is to avoid re-scanning for new undefined references---it's up to the invoker to order the libraries so that this works.

Let's try this on a BSD system. I have a NetBSD 5.0 VM hanging around, so I'll try that. It has recent GNU make, GCC and GNU binutils installed.

$ gmake
cc -g -fPIC -shared -o "lib1.so" "lib1.c"
cc -g -fPIC -shared -o "lib2.so" "lib2.c"
cc -g -fPIC  -L$(pwd) -Wl,-R$(pwd)  prog.c  -l1 -l2 -o prog
$ ./prog
Initializing lib1
Initializing lib2
Initializing prog
Hello, world!

Strangely, our initialization order is flipped. This doesn't matter for our program, but if lib1 consumed some static state in lib2, it would matter quite a bit. What happens if we flip the link order around to compensate? We edit the LDLIBS line and re-make.

$ nano Makefile
$ gmake clean && gmake
rm -f lib1.so lib2.so prog
cc -g -fPIC -shared -o "lib1.so" "lib1.c"
cc -g -fPIC -shared -o "lib2.so" "lib2.c"
cc -g -fPIC  -L$(pwd) -Wl,-R$(pwd)  prog.c  -l2 -l1 -o prog
$ ./prog
Initializing lib2
Initializing lib1
Initializing prog
Hello, world!

This has done what we want. But what's going on? This link order didn't even work on GNU/Linux. Not only does it work on BSD, but it's required if we want a sensible initialization order. Our initializers run in left-to-right order, so we need to put the “most depended on” libraries first, not last. This isn't a BSD quirk per se, because we're using the GNU linker in both cases. I suspect the linker scripts are nevertheless different in the two cases. However, I haven't had time to look into the details of why. I'd be interested to hear, if anyone knows. I guess this is the sort of pecularity that gives libtool a reason to exist.

[/devel] permanent link contact

Tue, 13 Dec 2011

Load addresses

For reasons I will only hint at, I want to predict the load address of a set of shared libraries, given an executable that links against them. Naturally, I have turned off address space layout randomization.

At first, I thought I could use ldd for this. It seems to work.

$ ldd /usr/local/src/git-1.7.5.4/git
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

But also, there is an environment variable called LD_TRACE_LOADED_OBJECTS that is supposed to have the same effect. As it happens, ldd is just a small wrapper script which sets this variable and invokes the dynamic linker, which on my system is /lib64/ld-linux-x86-64.so.2. Let's try doing this directly.

$ LD_TRACE_LOADED_OBJECTS=1 /usr/local/src/git-1.7.5.4/git
        linux-vdso.so.1 =>  (0x00007ffff7ffb000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7bc4000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff79a7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7608000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7ddc000)

That seems to work too. But wait! It's given us different load addresses than ldd did. Have I really turned off randomization? Well, yes. In fact, repeating either of these commands will reliably yield the output above, and they are reliably different from one another. What is going on?

Let's hack ldd so that it prints exactly what command it is going to execute.

$ ldd /usr/local/src/git-1.7.5.4/git
About to eval:  LD_TRACE_LOADED_OBJECTS=1 LD_WARN= LD_BIND_NOW= LD_LIBRARY_VERSION= LD_VERBOSE= /lib64/ld-linux-x86-64.so.2 /usr/local/src/git-1.7.5.4/git
verify_out is 
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

So, it has set a bunch of other environment variables to empty strings. They look innocuous enough. But also, it is invoking the loader directly, whereas we were just letting execve call the loader for us. Can we reproduce the result of ldd by running the same command it does?

$ LD_TRACE_LOADED_OBJECTS=1 LD_WARN= LD_BIND_NOW= LD_LIBRARY_VERSION= LD_VERBOSE= /lib64/ld-linux-x86-64.so.2 /usr/local/src/git-1.7.5.4/git
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

Yes, we can. Now, the big question: which one is correct? Let's run our program under gdb and inspect the memory map.

$ gdb --args /usr/local/src/git-1.7.5.4/git
... snipped ...
(gdb) break main
Breakpoint 1 at 0x404bf0: file git.c, line 509.
(gdb) run
Starting program: /usr/local/src/git-1.7.5.4/git 
[Thread debugging using libthread_db enabled]

Breakpoint 1, main (argc=1, argv=0x7fffffffddd8) at git.c:509
509     {
(gdb) print getpid()
$1 = 27023
(gdb) shell cat /proc/27023/maps | grep 'lib.*\.so.*'
7ffff7608000-7ffff779d000 r-xp 00000000 08:07 356590                     /lib/x86_64-linux-gnu/libc-2.13.so
... snipped ...

So, libc-2.13.so has been loaded at address 0x7ffff7608000, which is what we got from running with just the LD_TRACE_LOADED_OBJECTS flag set, and not what we got with ldd or with invoking ld-linux.so.2 specially.

Why the difference? Clearly, first execveing the loader perturbs the address assignment. It's not clear why this should be---isn't the loader itself the first thing to be loaded anyway? I'm not yet sure what is going on.

Another question: is predicting the load address even a sound thing to do? Given that we had to disable randomization in the first place, it seems like a bad idea. In my case, this approach will do for now, but ultimately I should defer my work until application start-up time. Then we can discover the actual loaded addresses of the various libraries, which is much more robust.

[/devel] permanent link contact

Thu, 01 Dec 2011

Weak dynamic symbols

Although I know more than the average bear about linkers, there's always things I don't know. Until now I never had cause to understand the following: how does the linker know which symbols to link dynamically, and which to link statically?

A bit of head-scratching reveals the only possible answer. Given a command-line, it does the usual thing: look at the command-line options, perform the standard library look-up procedure, gathering a list of object files---some static, some dynamic. If a symbol is defined by a dynamic library, make it dynamic. Otherwise, it stays static.

That sounds fairly sensible. But it can mean surprises. Suppose you have a C program that wants to support an optional feature to be linked in at load time. You might write something like the following.
int optional_function(int arg1, void *arg2) __attribute__((weak));

/* ... */

void do_something(void)
{
    if (optional_function) optional_function(42, &some_obj);
    /* else skip the optional part... */
}

If you pull this sort of trick within a shared library, it works fine. But inside an executable: no dice! If you compile this into an executable and look for optional_function in your dynamic symbol table, you'll be disappointed.

$ objdump -T my-program | grep optional_function
$

What is going on? Well, it's in the static symbol table, silly.

$ objdump -t my-program | grep optional_function
0000000000000000  w      *UND*  0000000000000000          optional_function
$

What does it mean to have an undefined symbol in your executable's static symbol table? It means it will silently take the value zero! In fact, the relocation records referencing your symbol have already been discarded.

$ objdump -rRd my-program | grep -A1 -B1 callq
(snip)
--
  400549:      bf 2a 00 00 00        mov    $0x2a,%edi
  40054e:      e8 ad fa bf ff        callq  0 <__init_array_end>
  400553:      b8 00 00 00 00        mov    $0x0,%eax
--
(snip)

Cheerily, the linker has inserted a direct-bound call to address zero in your code. That's not what we want! So, how can we fix it?

The trick is in the linker's (or at least the GNU linker's) --dynamic-list option. First, create a file called whatever you like (mine's called dynamic-list), containing the following.

{ optional_function; };

Now link your program passing --dynamic-list <your-dynamic-list> to the linker.

gcc -Wl,--dynamic-list -Wl,<your-dynamic-list> -o my-program my-program.c

Hey presto! You should now have your weak symbol in the dynamic symbol table.

$ objdump -t my-program | grep optional_function
0000000000000000  w   D  *UND*  0000000000000000          optional_function
$

That's a bit ugly. Recalling the linker behaviour I described at the beginning, the simpler way to do it is just to link your executable against a shared library defining optional_function.

You might wonder (as I do): what is the point of putting undefined symbols in an executable's static symbol table? Once the executable is output, it's too late to link anything with them. Surely they should all be “promoted” to dynamic symbols? [Update, 2012-5-19: there is of course a linker option for doing this, which in the GNU case is --export-dynamic. Still, I'm not sure why it isn't the default.]

It would also be nice to have an objcopy option for adding dynamic symbols in this way, so we can do it after the fact, rather than changing the linker command like we did above. However, this is nontrivial for the reason I mentioned---the relocation records that you would want have already been eliminated. So, we would need to re-create them. This is similar to something I began work on before. At some point I might resurrect my objcopy patches and try to repurpose them to this problem. For now, I will just hack in the extra linker options.

[/devel] permanent link contact

Thu, 06 Oct 2011

LLVM structural typing

I'm learning about the LLVM compiler infrastructure at the moment.

LLVM bitcode includes a notion of data types. These are used to control implicitly the size and encoding of values generated by various operations, to hint about mappings to underlying machine data types (e.g. on architectures that distinguish floating-point from integer registers) and to implicitly cause certain transformations to be effected, such as padding or sign extension. (I'm not yet sure whether all such operations need to be explicitly rendered as an LLVM “bitcast” operation or not. At least, LLVM's notion of types can be used to define validity of these operations, whether or not they happen implicitly.)

Moreover, addresses (pointers) are typed according to the type of the values they reference (point to). The data types are in this sense higher-order. (This is a weaker case of “higher-order” than types specifying the behaviour of functions. But it has some things in common. I will blog about this more in the future.) These data types control implicitly how much data is read or written by indirect loads and stores.

A typical C front-end will encode C data types directly into this type system. However, this is just a convenience. The encoding discards some of the semantics of the C type system, because in LLVM, composite types are treated purely structurally, whereas in C, they are always treated nominally. Consider this program.

#include <stdlib.h>

struct Foo {
  int a;
  int b;
};

struct Bar {
  int x;
  int y;
};

int main(void)
{
  struct Foo *f = (struct Foo *) malloc(sizeof (struct Foo));
  struct Bar *b = (struct Bar *) malloc(sizeof (struct Bar));

  free(f);
  free(b);

  return 0;
}

In LLVM bitcode, using llvm-gcc, we get the following.

; ModuleID = 'test.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.Bar = type { i32, i32 }
%struct.Foo = type { i32, i32 }

define i32 @main() nounwind {
entry:
  %retval = alloca i32
  %0 = alloca i32
  %f = alloca %struct.Bar*
  %b = alloca %struct.Bar*
  %"alloca point" = bitcast i32 0 to i32
  %1 = call noalias i8* @malloc(i64 8) nounwind
  %2 = bitcast i8* %1 to %struct.Bar*
  store %struct.Bar* %2, %struct.Bar** %f, align 8
  %3 = call noalias i8* @malloc(i64 8) nounwind
  %4 = bitcast i8* %3 to %struct.Bar*
  store %struct.Bar* %4, %struct.Bar** %b, align 8
  %5 = load %struct.Bar** %f, align 8
  %6 = bitcast %struct.Bar* %5 to i8*
  call void @free(i8* %6) nounwind
  %7 = load %struct.Bar** %b, align 8
  %8 = bitcast %struct.Bar* %7 to i8*
  call void @free(i8* %8) nounwind
  store i32 0, i32* %0, align 4
  %9 = load i32* %0, align 4
  store i32 %9, i32* %retval, align 4
  br label %return

return:                                           ; preds = %entry
  %retval1 = load i32* %retval
  ret i32 %retval1
}

declare noalias i8* @malloc(i64) nounwind

declare void @free(i8*) nounwind

Notice that although the compiler has emitted two LLVM type definitions, one for each of our struct types, it then proceeds to use only the first one of them. The second is redundant, because the two are structurally equivalent. This starts to look even more peculiar when we make our data types recursive.

#include <stdlib.h>

struct Foo {
  int a;
  int b;
};

struct Bar {
  int x;
  int y;
};

struct FooRecursive {
  int a;
  struct FooRecursive *next;
};

struct BarRecursive {
  int a;
  struct BarRecursive *next;
};

int main(void)
{
  struct Foo *f = (struct Foo *) malloc(sizeof (struct Foo));
  struct Bar *b = (struct Bar *) malloc(sizeof (struct Bar));

  struct FooRecursive *fr = (struct FooRecursive *) malloc(sizeof (struct FooRecursive));
  struct BarRecursive *br = (struct BarRecursive *) malloc(sizeof (struct BarRecursive));
  
  free(f);
  free(b);
  free(fr);
  free(br);
  
  return 0;
}

This gives us the following.

; ModuleID = 'test.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.Bar = type { i32, i32 }
%struct.BarRecursive = type { i32, %struct.BarRecursive* }
%struct.Foo = type { i32, i32 }
%struct.FooRecursive = type { i32, %struct.BarRecursive* }

define i32 @main() nounwind {
entry:
  %retval = alloca i32
  %0 = alloca i32
  %f = alloca %struct.Bar*
  %b = alloca %struct.Bar*
  %fr = alloca %struct.BarRecursive*
  %br = alloca %struct.BarRecursive*
  %"alloca point" = bitcast i32 0 to i32
  %1 = call noalias i8* @malloc(i64 8) nounwind
  %2 = bitcast i8* %1 to %struct.Bar*
  store %struct.Bar* %2, %struct.Bar** %f, align 8
  %3 = call noalias i8* @malloc(i64 8) nounwind
  %4 = bitcast i8* %3 to %struct.Bar*
  store %struct.Bar* %4, %struct.Bar** %b, align 8
  %5 = call noalias i8* @malloc(i64 16) nounwind
  %6 = bitcast i8* %5 to %struct.BarRecursive*
  store %struct.BarRecursive* %6, %struct.BarRecursive** %fr, align 8
  %7 = call noalias i8* @malloc(i64 16) nounwind
  %8 = bitcast i8* %7 to %struct.BarRecursive*
  store %struct.BarRecursive* %8, %struct.BarRecursive** %br, align 8
  %9 = load %struct.Bar** %f, align 8
  %10 = bitcast %struct.Bar* %9 to i8*
  call void @free(i8* %10) nounwind
  %11 = load %struct.Bar** %b, align 8
  %12 = bitcast %struct.Bar* %11 to i8*
  call void @free(i8* %12) nounwind
  %13 = load %struct.BarRecursive** %fr, align 8
  %14 = bitcast %struct.BarRecursive* %13 to i8*
  call void @free(i8* %14) nounwind
  %15 = load %struct.BarRecursive** %br, align 8
  %16 = bitcast %struct.BarRecursive* %15 to i8*
  call void @free(i8* %16) nounwind
  store i32 0, i32* %0, align 4
  %17 = load i32* %0, align 4
  store i32 %17, i32* %retval, align 4
  br label %return

return:                                           ; preds = %entry
  %retval1 = load i32* %retval
  ret i32 %retval1
}

declare noalias i8* @malloc(i64) nounwind

declare void @free(i8*) nounwind

Notice that the self-referencing structure of FooRecursive has been lost, again because a different type is structurally equivalent.

Now for a final experiment: what about singleton structs? Are they structurally equivalent to a single element? I'll throw in a typedef too, to see whether that appears.

#include <stdlib.h>

struct Foo {
  int a;
};
typedef int Baz;

int main(void)
{
  struct Foo *f = (struct Foo *) malloc(sizeof (struct Foo));
         Baz *b =        (Baz *) malloc(sizeof        (Baz));
  
  free(f);
  free(b);
  
  return 0;
}

Here's the code it generates.

; ModuleID = 'test.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.Foo = type { i32 }

define i32 @main() nounwind {
entry:
  %retval = alloca i32
  %0 = alloca i32
  %f = alloca %struct.Foo*
  %b = alloca i32*
  %"alloca point" = bitcast i32 0 to i32
  %1 = call noalias i8* @malloc(i64 4) nounwind
  %2 = bitcast i8* %1 to %struct.Foo*
  store %struct.Foo* %2, %struct.Foo** %f, align 8
  %3 = call noalias i8* @malloc(i64 4) nounwind
  %4 = bitcast i8* %3 to i32*
  store i32* %4, i32** %b, align 8
  %5 = load %struct.Foo** %f, align 8
  %6 = bitcast %struct.Foo* %5 to i8*
  call void @free(i8* %6) nounwind
  %7 = load i32** %b, align 8
  %8 = bitcast i32* %7 to i8*
  call void @free(i8* %8) nounwind
  store i32 0, i32* %0, align 4
  %9 = load i32* %0, align 4
  store i32 %9, i32* %retval, align 4
  br label %return

return:                                           ; preds = %entry
  %retval1 = load i32* %retval
  ret i32 %retval1
}

declare noalias i8* @malloc(i64) nounwind

declare void @free(i8*) nounwind

Predictably, the typedef has gone away entirely, because it introduces no new structure. However, our singleton struct has stayed around. This isn't surprising either, because LLVM has instructions for accessing field members, whose semantics are affected by these structural differences. Composite types are not just sugar for arrays of bytes or words.

This does mean that if we wanted to encode nominal types into our LLVM bitcode, we could do it by wrapping nominally distinct types in differing depths of layered singleton structs. This would affect the bitcode that came out, e.g. inserting extra GetElementPtr operations, but shouldn't affect the optimised compiler output.

Overall, we can say that LLVM's data types are an annotation useful primarily for propagating and/or inferring data size and encoding information contextually through load and store operations. They are also used for checking: the bitcode is checked for first-order type errors. Since they are raised at the point of pointer use (e.g. a pointer assignment without appropriate bitcast is an error), they can catch likely first-order errors early, i.e. when a pointer is generated or stored, rather than later when it is dereferenced and its target data is misinterpreted. Here by “first-order type errors” I roughly mean those at the machine level, meaning they always concern the misinterpretation of bit-patterns encoding primitive values (integers, addresses, floating-point numbers). Since nominally distinct data types are conflated when they are structurally equivalent, then without using the encoding trick I mentioned, the bitcode will not capture, say, that one variable encodes polar coordinates by x and r, and another by r and theta. Detecting violation of these abstractions is beyond the reach of any analysis based (only) on LLVM data types using the standard encoding.

This includes dynamic analyses. Right now I'm writing an is_a function for KLEE. KLEE is (effectively) a fancy LLVM interpreter. Without recourse to the source code and/or hacking the C front-end, we can only do structural is_a, which is slightly disappointing. I should add that I'm not criticising LLVM here at all. Intermediate representations are not the place for fancy type systems, and the structural approach works nicely. It just means more work for me, when it looked for a moment as though I could abuse pre-existing functionality.

[/devel] permanent link contact

Wed, 01 Jun 2011

Memtable again

I've finally got a nicely packaged implementation of memtables, the data structure I introduced in a previous blog post. It's in a single header---memtable.h. I've also fixed a couple of stupid bugs that crept into malloc_hooks.c just before I released the initial version. You can see an example of combining these two in heap_index_hooks.c---which you can compile into a simple LD_PRELOADable shared library that will instrument the glibc malloc to keep an indexed table of allocated chunks, keyed on address. It's pretty easy to search the memtable to find the heap chunk for a given pointer anywhere into an object. I'll integrate this into my Cake runtime implementation soon, and the whole lot will appear here in due course (i.e. eventually).

If you use any of these files, please drop me a line to say how you got on---I'd really appreciate it.

[/devel] permanent link contact

Thu, 19 May 2011

Namespace problems

It's always been a theoretical problem with C that there is no namespacing. I'd often wondered how much of a practical problem this really was, with “nothing major” my tentative answer. I've finally run into my first bona-fide gotcha arising out of this problem. In short: wxWidgets and GLib both define a GSocket.

Annoying as this is, it wouldn't be in my top ten grumbles about the experience of programming in C. A far bigger related problem is versioning. This doesn't get cited as a weakness of C because it's also a weakness of most other programming languages. The very reason I ran into the namespace problem was because I had to compile wxWidgets 2.6, rather than using the 2.8 revision that's packaged for my distribution. Version mismatches can be seen as namespace collisions too. Instead of getting the version you want, the namespace has been populated with slightly different stuff that is, despite its close relationship to what you actually require, still incompatible, much the same as if the namespace were polluted with random third-party stuff.

Versioning issues could perhaps be brought more under the programmer's control. Most programming languages don't have an explicit notion of “version” when importing stuff. But when explicitly consuming some target API, you are always assuming at least something about its version. Having the programmer declare which version of a set of declarations they want to import would be straightforward. In C, it could even be done quite neatly with just the preprocessor---say, #define __LIBFOO_REQUESTED_VERSION 4.2) before the relevant #include.

Of course, pessimistically refusing to link across nominal mismatches of version would be a bad solution. We want a more structural and, indeed, behavioural or “semantic” approach. With the C preprocessor approach I outlined, it becomes the header file author's responsibility to embed a test about which prior API version the associated implementation is compatible with, most likely using a simple #if test. This responsibility is not unreasonable I'd say---the developers are in the best place to say what has changed with a new revision. And since it's in a header file, if the maintainers are lazy, the client programmer can override it.

One shortcoming of this approach is that the client programmer might be too lazy to work out which is the earliest library version their code will work with, and will instead select whatever version they are developing with. This is safe, but prevents some valid compositions. On a different system with a slightly older version of the library, the header might conservatively conclude that it's not compatible with the client, even though it could work. Anyway, I don't worry about this too much. Lots of researchers have thought about versioning before, so there's probably some good solutions knocking around.

Back to the sockets example, it's perhaps unsurprising that the name collision occurred when linking two chunks of infrastructure code. Name collisions are most likely when abstracting the same domain, having the same natural language vocabulary---namely sockets in this case. This is much more likely to happen in infrastructure software (i.e. modelling system resources) than application level software (modelling circles or ellipses or airline reservations or health records and so on), simply because you're less likely to link multiple instances of the latter together. Whereas application-level code is at or near the top of the software dependency graph, the infrastructure stuff is lower down so more likely to get sucked into a program through dependency.

I was interested to note Nick Nethercote's recent blog entry about a problem with duplication (generally) and bloat (specifically) associated with multiple wrapper layers for system calls and other nonportable interfaces. He was talking about mmap(), but the socket abstraction is another example. I have some research proto-ideas that might help with this problem. Essentially I'm interested in recovering a more finer-grained style of interface description from code, based on the idea of “relational interfaces”. You could then use this description to infer that two sets of functions had very similar behaviour, and factor out the duplication (with appropriate refactoring or adaptation tools).

This whole problem is another consequence of our fragile direct-interfacing, in-order methods for constructing of software. If we had a more flexible way of constructing software, the problem wouldn't arise. Rather than slavishly building on predefined interfaces that are specific to one underlying component---like one mmap() abstraction layer, or one socket abstraction--- we need smarter tools for specifying our requirements abstractly and finding customised ways of satisfying them using a range of “found” code. This is what my Onward! '09 proto-paper was ranting about. I guess it's good that I'm still ranting. Interface hiding is as good an idea as ever, and more work on it will happen, when I get time....

[/devel] permanent link contact

Memtables

At the MMNet workshop in Glasgow last week, I talked about memtables. These are an efficient associative data structure, built using virtual memory support on modern OSes (currently implemented for Linux only), that are useful whenever you want to key a table on addresses in memory. See my slides for more.

Since entries with numerically similar keys are stored close to each other, memtables are, like certain other associative data structures, amenable to searching within a key range as well as exact-match lookups. By contrast, hash tables can't do this. (That said, a hash table supporting duplicate keys can be used to store items grouped into small equivalence classes. This is sometimes good enough, and could be made to work in my case. Nonuniform key duplication will mess up the O(1) nature of hash tables though.)

Memtables seem like they could be useful in lots of places. I invented them for DwarfPython as a fast way of storing and retrieving metadata given a key that may be an interior pointer (hence the searching requirement). I'm also (soon) using them in Cake as a fast way of tracking what objects are associated with what other objects.

The key space doesn't have to be addresses. It's possible we could even use memtables for free chunk binning, since large sizes are sparsely used. I need to do some more experiments to establish this.

The implementation comes in two parts:

  • A generic set of malloc hooks for glibc: these hooks aree “generic” in that they're designed to be easily specialised for various conceivable kinds of instrumentation. They're not generic with respect to the allocator---sadly they're specific to glibc, but most mature allocators should have some similar mechanism. The key usefulness in these hooks is factoring the various cases of the malloc API ---specifically the complex behaviour of realloc, but also other annoyances including memalign and null frees--- into an easy-to-use set of higher-level hooks. These are likely (but not guaranteed) to be a better match for whatever your instrumentation is doing. For example, defining a post_successful_alloc() function will hook all events that allocate a new heap block, whether they originated in a malloc(), a realloc() or a memalign().
  • a generic memtable library: this will appear soon! It's a set of hook definitions that maintain a memtable, and a lookup function.
  • Memtables are strictly faster than a hash table, at least for lookups, because they are basically a hash table without the hashing. At least for most applications of memtables, the table itself acts as an index for a bunch of linked lists---call them bins or buckets. Rather than mapping the key space onto a smaller hash space in order to keep the index small, we index directly by the key, and rely on the virtual memory trick to keep the index small. Since we can only save page-sized chunks of space, the key-space really needs to be used in a clustered and/or very sparse fashion. Just one used key per page of index is enough to allocate the whole table in physical memory, which we don't want. So if your table has four-byte entries, say, uniform key usage should be a lot less than one per thousand possible keys---but clusters of many are okay, so long as they're thousands apart.

    [/devel] permanent link contact

    Tue, 22 Mar 2011

    How much memory could an mmap() map...

    ...if an mmap() could map memory? I've had to ask this question of Linux on 64-bit x86 platforms recently.

    For reasons I will only hint at, I want to allocate a huge region of virtual address space for a structure a bit like a linear page table (called a “virtualized page table” on Wikipedia). We rely on a particular behaviour of Linux's mmap(): that mapping some memory isn't the same as committing to the availability of any underlying physical memory. Passing the MAP_NORESERVE flag means that memory will only be allocated when written to, hence allowing us to create a sparse table easily.

    I decided my table should have one word per 4KB of memory. For a 32-bit machine, which has 4-byte words and 220 such (aligned) regions, this means I need 4MB of virtual address space for my table (i.e. about a thousandth of the VAS). If we ask mmap() for such a region, it will clearly oblige us. On a 64-bit machine, which has 8-byte words and 252 such regions, I need 255 bytes of virtual address space for my table---32 petabytes, or about eight billion times as much as in the 32-bit case, but again, only a small fraction of the total address space (in this case a 512th, because words are twice as big).

    Here's a quick program you can run to test whether you can do an mmap() of a given size.

    #include <stdio.h>
    #include <errno.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <assert.h>
    
    int main(int argc, char **argv)
    {
        assert(argc > 1);
        size_t mapping_size = strtol(argv[1], NULL, 0);
        assert(errno != ERANGE);
        assert(mapping_size > 0);
        assert(sizeof(size_t) == 8);
            
        void *ret = mmap(NULL, mapping_size, PROT_READ|PROT_WRITE, 
            MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
    
        if (ret == MAP_FAILED)
        {
            fprintf(stderr, "error: %s\n", strerror(errno));
            return 1;
        }
        else 
        {
            fprintf(stderr, "success!\n");
            return 0;
        }
    }
    

    And here's a shell script to drive it with powers of two until it fails.

    #!/bin/bash
    
    for exp in `seq 10 50`; do
        ./test $(( 2 ** $exp )) || break;
        echo "Succeeded for 2 ^ $exp"
    done
    

    I'd be curious to know whether anyone on an x86-64 Linux machine maxes out anywhere different than 246 bytes. The kernel source will have the answer, but I can't be bothered wading through it right now. Interestingly, turning off the overcommit check (i.e. writing "1" to /proc/sys/vm/overcommit_memory) doesn't increase the limit for me.

    By the way, I'm using strtol because atol seemed to be narrowing the result to 32 bits even though a long is 64 bits. Instead of 231 I got -231, which unsurprisingly made mmap() fail. This seems like a bug, but probably isn't for some reason (possibly including a stupid mistake by me).

    As you might have guessed, I'm using this huge region of memory as a big flat structure to record metadata about memory. The main trick of a linear page table is that we can use virtual memory to encode large sparse arrays, without allocating memory for page-sized regions of the table that are empty. This generalises to sparse tables other than page tables. The one I'm building is for tracking allocated heap regions.

    [Update, 2011-3-23: thanks to Malcolm Scott who pointed out that my problem might be more tractable, because current x86-64 processors only implement a 48-bit address space. This also means that the 46-bit limit makes more sense---my mmap() successfully allocated a quarter of the usable virtual address space! Now I'm wondering: are those 48 bits something I can rely on for the nominal x86-64 architecture, or will running the same binaries on future hardware silently issue addresses from the larger 64-bit space? For now it doesn't really matter, but send answers if you have them (on a postcard, if you like) please.]

    [/devel] permanent link contact

    Fri, 19 Feb 2010

    C++ Gotme number 4: constant surprise

    When I was less experienced with C++, I would avoid creating any sort of constness in my code because it invariably led to headaches. When I'd tried, all too often I'd found myself trawling through the source code either adding or removing constness all over the place in order to make things compile. It's certainly true that constness has to be done right or not at all. Like a lot of C++, it's not resilient either to mistakes or to change.

    These days I usually make a reasonable job. I've found the best way to understand constness is as follows. Every object in a C++ program exposes exactly two variants of its interface: the “full” one and the const one. The actual boundary between these, at least in the case of user-defined types, is somewhat flexible (as witnessed by mutable, and by your forgetfulness to add const to method definitions!).

    Grokking the difference between a const pointer and pointer-to-const is of course a rite of passage in C and C++ programming. A more obscure subtlety about const in C++ is that const references let you pass around temporaries' lvalues, but you can't do this in a non-const fashion. It's not clear why not, except that it's something you'd rarely want to do ---unless, of course, you had omitted to add the const annotation in the signature of whatever function you wanted to pass the lvalue to. That's one reason why getting const right is something you can't really opt out of.

    Today I was wrestling with an unfortunate side of const---untraceable compilation errors. I'm using the boost graph library so I can run graph algorithms over my DWARF data structure. Since all my “nodes” live in a std::map, I was specialising the graph_traits giving the map's value_type as the node type. Here's what the compiler said.

    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.4.3/../../../../incl
    ude/c++/4.4.3/bits/stl_pair.h: In member function ?std::pair<const long long uns
    igned int, boost::shared_ptr<dwarf::encap::die> >& std::pair<const long long uns
    igned int, boost::shared_ptr<dwarf::encap::die> >::operator=(const std::pair<con
    st long long unsigned int, boost::shared_ptr<dwarf::encap::die> >&)?:
    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.4.3/../../../../incl
    ude/c++/4.4.3/bits/stl_pair.h:68:   instantiated from ?boost::concepts::VertexLi
    stGraph<G>::~VertexListGraph() [with G = dwarf::encap::dieset]?
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp:166:   instantiated from 
    ?static void boost::concept::requirement<Model>::failed() [with Model = boost::c
    oncepts::VertexListGraphConcept<dwarf::encap::dieset>]?
    /home/srk31/opt/include/boost/concept_check.hpp:43:   instantiated from ?void bo
    ost::function_requires(Model*) [with Model = boost::concepts::VertexListGraphCon
    cept<dwarf::encap::dieset>]?
    test-6.cpp:8:   instantiated from here
    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.4.3/../../../../incl
    ude/c++/4.4.3/bits/stl_pair.h:68: error: non-static const member ?const long lon
    g unsigned int std::pair<const long long unsigned int, boost::shared_ptr<dwarf::
    encap::die> >::first?, can't use default assignment operator
    In file included from test-6.cpp:1:
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp: In destructor ?boost::co
    ncepts::VertexListGraph<G>::~VertexListGraph() [with G = dwarf::encap::dieset]?:
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp:166:   instantiated from 
    ?static void boost::concept::requirement<Model>::failed() [with Model = boost::c
    oncepts::VertexListGraphConcept<dwarf::encap::dieset>]?
    /home/srk31/opt/include/boost/concept_check.hpp:43:   instantiated from ?void bo
    ost::function_requires(Model*) [with Model = boost::concepts::VertexListGraphCon
    cept<dwarf::encap::dieset>]?
    test-6.cpp:8:   instantiated from here
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp:188: note: synthesized me
    thod ?std::pair<const long long unsigned int, boost::shared_ptr<dwarf::encap::di
    e> >& std::pair<const long long unsigned int, boost::shared_ptr<dwarf::encap::di
    e> >::operator=(const std::pair<const long long unsigned int, boost::shared_ptr<
    dwarf::encap::die> >&)? first required here 
    make: *** [test-6.o] Error 1
    

    The bottom error refers to line 188 of graph_concepts.hpp, which is doing an innocuous assignment from a vertex iterator v = *p.first;. The variable v is an instance of my map entry type. Somehow, the compiler can't synthesise a copy constructor for the pointed-to vertex. This was incredibly difficult to track down because it wasn't due to any use of const in my code directly. I scoured my code for suspicious const, but to no avail. What had escaped me is that the value_type in a std::map is as follows (from Stroustrup C++PL section 17.4.1.1).

    typedef pair<const Key, T> value_type;
    

    In other words, maps are designed so that you can't update the key of an entry once they're created. This is very sane, because to do so might violate some invariant of the containing structure (imagining the popular red-black tree implementation). I hadn't noticed this limitation before because although it's common to initialise new pair objects, it's rare to update an existing one. Even though I had purposely avoided making the whole pair const in my code, there was already a const lurking in the header.

    That the compiler couldn't give me a better error message (i.e. one actually pointing to the code at fault) is certainly disappointing, and will be fixed in a better compiler. Whether all this fiddling constness is a weakness with the C++ language I'm not sure. It's certainly self-consistent and well-motivated. Other languages might not have me worrying about const, but they also might not catch certain errors. Is C++'s trade-off good value? It's also not clear whether a better definition of map might not have lifted the const into individual references to the value_type... or not. I'd like to be able to rail and propose some better language design, but part of me suspects that sometimes in programming, the devil really is inescapably in the details.

    [/devel] permanent link contact

    Mon, 01 Feb 2010

    Smart pointers: smart pointers (C++ Gotme number 3)

    I've been writing some C++ code using boost's smart pointers recently. Naturally, I tripped myself up. Memory corruption appeared, and I noticed that my shared objects were the problem. They were getting destructed even though I had a shared_ptr pointing to them the whole time. Here's how it happened.

    boost::shared_ptr<Die_encap_subprogram>
    create_subprogram(Die_encap_base& parent, boost::optional<const std::string&> name)
    {
        return boost::shared_ptr<Die_encap_subprogram>(new Die_encap_subprogram(parent, name));
    }
    

    The return statement is evaluated as follows.

    So far, so good. Here's the calling code.

    	std::cerr << "dies size before: " << ds().size();
    	abstract::factory::get_factory(get_spec()).create_subprogram(
    		(**all_cus.compile_units_begin()),
    		std::string(symname));
    	std::cerr << "dies size after: " << ds().size();
    

    Although my factory method “helpfully” returns a shared pointer, we discard the return value of the factory method. So, for the object to stay alive, we are relying on there being a copy of the boost::shared_ptr<Die_encap_base> somewhere. The code I've shown doesn't take a copy, so where could one come from? Well, one is created inside the object constructor, where it registers itself with a containing data structure. Here's the code which creates that second shared pointer.

    Die_encap_base(Dwarf_Half tag,
            Die_encap_base& parent, 
            boost::optional<const std::string&> name)
         :	die(parent.m_ds, parent.m_offset, tag, parent.m_ds.next_free_offset(), 0, attribute_map(), die_off_list()) // FIXME
        {
            m_ds.insert(std::make_pair(m_offset, boost::shared_ptr<dwarf::encap::die>(this)));
            parent.m_children.push_back(m_offset);
            m_attrs.insert(std::make_pair(DW_AT_name, dwarf::encap::attribute_value(
                std::string(*name))));
    	}
    

    If this second shared pointer is going to be sufficient to keep the object alive, we require some “action at a distance”: the first shared pointer must somehow become aware that we have just created another shared pointer to its object. How might this work? The shared_ptr implementation would have to keep its reference counts in a shared table, accessible to all shared pointers, so that they can look up the relevant refcount by from the pointer. The table must be shared between all shared pointer instances, not just those of a given type parameter, to allow upcasting and downcasting---notice how the target type is different in this snippet than in the earlier one, because we're in the base class. For this reason, our hypothetical table implementation would have to catch pointers-to-subobjects, probably by storing address ranges not just object start addresses. Even this is tricky if the shared pointer was created from an imprecisely-typed pointer (i.e. one whose static type is a supertype) because we might be oblivious to some larger enclosing allocation (i.e. the superobject actually allocated). This is doable, albeit not portably, given boost::shared_ptr's caveat that it only applies to pointers that were returned by new---so we could get the full address range using the heap metadata.

    Instead of all this table business, boost::shared_ptr just uses a shared count created when constructing a smart pointer from a non-smart pointer. Each pointer carries the address of a separate object holding the shared refcount. When you copy the shared pointer, it increments the count and propagates the shared count's address into the new copy of the pointer. Naturally, since it relies on continuity of propagation through shared pointers, it doesn't work when that chain is broken---you end up with multiple counts for the same object. My code suffers exactly that problem: it creates a completely separate shared pointer to what is “coincidentally” the same object referenced by some shared_ptr (i.e. communicated one or more intervening non-shared pointers, in my case the this pointer in the constructor call).

    The fix is simple enough: have the factory return a regular pointer or reference to the constructed object. Of course, what if some client code innocently wants to use shared_ptr on that object? That's not necessarily a problem, but it won't magically extend the life of objects that would otherwise be reclaimed according to the object lifetime policies which my own code's use of shared_ptr creates. This ability to create, out of innocuous-looking code, multiple conflicting policies about when an object should reclaimed, is a limitation of the smart pointer approach, and is stronger than simply “shared_ptr won't collect cycles”.

    Slogan-wise, this is perhaps an argument in favour of “smart objects, not smart pointers”. True per-object reference counts can be implemented in at least two ways: either a Python-like in-band solution, where each object has a refcount built-in, or else the unconventional out-of-band table solution I outlined above. The former isn't an option for C++, because we can't retroactively add a count field to an object, and don't want all instances of all classes to pay the overhead. However, the latter solution is quite appealing to me. Keeping object descriptions out-of-band is a powerful technique for retaining binary compatibility while adding run-time facilities, and is the basis of my proposed implementation of Python, which I might just outline in a future post.

    [/devel] permanent link contact

    Wed, 14 Oct 2009

    C++ Gotme number 2: templates, typename and specialization syntax

    Here's some innocuous-looking template definitions in C++.

    // specialize this template!
    template <unsigned spec_id> class Extended_By 
    {
    public:
        typedef struct {} spec;
    };
            
    // specialization for Dwarf 3
    template <> class Extended_By<0U>
    {
    public:
        typedef empty_def spec;
    };
    
    
    template <unsigned spec_id> 
    class table_def : public Extended_By<spec_id>::spec
    {
    public:
        typedef Extended_By<spec_id>::spec Extended;
    // ...
    };
    

    I am implementing a delegation hierarchy of tables: lookup methods on the tables delegate to their base implementation if the lookup fails. The hierarchy is rooted at a special-cased “empty” definition (not shown). Each non-root table has a similar implementation, but must be a separate class (as it may want to override certain member functions), hence my use of templates. To define a new table, we specialize the latter template with a new integer argument denoting the concrete table being defined. (In case you're wondering, tables are singletons---but they get pointed-to at runtime, hence virtual dispatch.) The Extended_By template is simply a compile-time mapping from integers (which denote tables) to the type of the underlying table (i.e. the table which that table delegates to, if its lookup fails).

    Unfortunately, the code above doesn't compile.

    g++ -Wall -O0 -g3 -I/home/srk31/opt/include -I/usr/include/python2.5 -I/usr/incl
    ude -I../../../libsrk31c++ -c -o "spec.o" "spec.cpp"
    In file included from spec.cpp:8:
    spec.hpp:242: error: type `dwarf::spec::Extended_By<spec_id>' is not derived fro
    m type `dwarf::spec::table_def<spec_id>'
    

    The error message makes no sense at all. In fact, the problem is that the compiler can't tell that Extended_By<spec_id> is a type. If you insert the typename keyword into the latter template definition, as follows....

    template <unsigned spec_id> 
    class table_def : public Extended_By<spec_id>::spec
    {
    public:
        typedef typename Extended_By<spec_id>::spec Extended;
    // ...
    };
    

    ...then it all magically starts working. I must confess I'm not entirely sure why---surely the usual name lookup on Extended_By would work just fine?

    On a similar theme, I was specializing some predicates defined by the table template, using code as follows.

    template<unsigned spec_id> 
    bool table_def<0U>::attr_describes_location(int attr) const
    {
    	return this->Extended::attr_describes_location(attr)
    		|| attr == DW_AT_location
    		|| attr == DW_AT_data_member_location
    		|| attr == DW_AT_vtable_elem_location
    		|| attr == DW_AT_string_length
    		|| attr == DW_AT_use_location
    		|| attr == DW_AT_return_addr;
    }
    

    This also fails to compile, with a very confusing message.

    g++ -Wall -O0 -g3 -I/home/srk31/opt/include -I/usr/include/python2.5 -I/usr
    /include -I../../../libsrk31c++ -c -o "spec.o" "spec.cpp"
    spec.cpp:272: error: prototype for `bool dwarf::spec::table_def<0u>::attr_d
    escribes_location(int) const' does not match any in class `dwarf::spec::tab
    le_def<0u>'
    spec.hpp:278: error: candidate is: bool dwarf::spec::table_def::at
    tr_describes_location(int) const [with unsigned int spec_id = 0u]
    make: *** [spec.o] Error 1
    
    The solution is to eliminate the template parameter.

    template<> 
    bool table_def<0U>::attr_describes_location(int attr) const
    {
    	return this->Extended::attr_describes_location(attr)
    		|| attr == DW_AT_location
    		|| attr == DW_AT_data_member_location
    		|| attr == DW_AT_vtable_elem_location
    		|| attr == DW_AT_string_length
    		|| attr == DW_AT_use_location
    		|| attr == DW_AT_return_addr;
    }
    

    It's not really clear why the latter is a valid syntax for specialization while the former isn't. However, a useful rule of thumb would seem to be that you should only list template arguments up-front (after “template”) if your definition depends on them. It doesn't matter if the originating class template uses more arguments. By contrast, the argument list attached to the classname is what nails down the particular template instance you're specializing for---this argument list should correspond directly to that of the template class definition.

    [/devel] permanent link contact

    Tue, 29 Sep 2009

    Standalone ksmserver, part 2: getting it right

    I blogged previously about my attempts to use KDE's session manager in a standalone fashion (with fvwm, in my case). What I presented there isn't entirely satisfactory, mainly because not all window sizes and positions get restored when restoring the session. I've finally worked out why. Unfortunately, this has also revealed that some content in the previous post is wrong. Allow me to correct myself here.

    So, all you need in your fvwm config is something like the following.

    AddToFunc SessionInitFunction
     + I Exec dcop klauncher klauncher autoStart 3; \
    dcop ksmserver ksmserver autoStart0Done; \
    dcop ksmserver ksmserver kcmPhase1Done; \
    dcop ksmserver ksmserver autoStart1Done; \
    dcop ksmserver ksmserver kcmPhase2Done; \
    sleep 4; dcop ksmserver ksmserver autoStart2Done
    
    # assuming you quit using a Quit-Verify menu...
    AddToMenu Quit-Verify   "Quit"  SaveQuitSession
    

    [/devel] permanent link contact

    Shell Gotme number 1 -- the Heisenbergian process trees

    On my Lab machine, the X login system doesn't clean up stray child processes that your X session may have left behind. (I've a feeling the Debian xinit scripts do this, but I'm not sure.) This was bothering me, because I start some background jobs in my .xsession which I want to die naturally when the session ends. So I put the following towards the end of my .xsession.

    echo -n "End of .xsession: detected living child pids: " 1>&2
    ps --no-header f -o pid --ppid $$ | \
    	while read pid; do kill ; done 2>/dev/null

    Unfortunately, I found that those pesky children were still alive. (Can you tell what's wrong? Yes, that's right.) Both the ps process and the while subshell are among the children which are being killed. So one way or another, the pipeline gets broken before the loop has managed to kill the children I actually wanted to kill. A version which doesn't suffer this problem is as follows.

    child_pids=$( ps --no-header f -o pid --ppid $$ )
    for pid in ; do kill  2>/dev/null; done

    It's another instance of the familiar Heisenbergian nature of inspecting process trees from the shell: doing many basic things from the shell entails creating processes, so inspecting the shell's own process tree is likely to yield perturbed results. Usually it's just an unwanted entry in ps (as with the old ps | grep foo gotcha) but the above shows it sometimes causes subtler problems.

    [/devel] permanent link contact

    Fri, 18 Sep 2009

    Python, threading and the madness of language implementors

    I just stumbled across this video of a talk by David Beazley about the implementation of Python and, in particular, threading. In a nutshell, threading in Python is implemented as a complete minimal-effort hack on top of an interpreter that is essentially single-threaded by design. Static state is again the culprit---there's a big lock, called the Global Interpreter Lock, much like Linux used to have the Big Kernel Lock. But, rather than just protecting some carefully-selected critical regions, it protects all Python computation!

    So, alarmingly, the performance of Python threading is absolutely disastrous. It's ironic that Google are heavy users of Python, given that they work with some of the largest datasets on the planet and generally have to know a thing or two about optimisation and concurrency. Of course they don't use Python for their core systems, and are sponsoring an improvement effort called Unladen Swallow.

    I have some research ideas that predicated heavily on the position that language implementors often don't do a great job, particularly in the case of dynamic languages. So if we have to rewrite a bunch of dynamic language implementations, that's not really a terrible thing. I'm glad to have yet more evidence of this.

    [/devel] permanent link contact

    Tue, 15 Sep 2009

    clone() and the unusual process dynamics

    I had a weird problem with my X login scripts recently on my Lab machine---I noticed that for X sessions, the LD_LIBRARY_PATH environment variable wasn't being set, even though every other variable set from my ˜/.profile was still appearing correctly. After I bit of digging, I discovered why, but that only opened up an extra can of mystery.

    The basic problem was that ssh-agent is a setuid program. so at the start of loading, the Linux loader removes all dynamic linker options (including LD_PRELOAD and LD_LIBRARY_PATH) from the environment to avoid running user code with elevated privileges. (It would make more sense just to ignore them, but still to propagate them to children... but anyway.) I was still a bit puzzled though, because I wasn't knowingly running my session as a child of ssh-agent---my login scripts do start one up if it's not already running, but it's supposed to run as a sibling of my main X session script, rather than using the ssh-agent <command> mechanism to start a child session. And ps agreed with me---my session wasn't descended from an ssh-agent process... but I did have two running, where I thought my scripts went to pains to ensure there was only one.

      PID TTY      STAT   TIME COMMAND
    19233 ?        Ss     0:00 /bin/bash -l /home/srk31/.xsession
    19573 ?        S      0:00  \_ ssh-agent
    19589 ?        S      0:00  \_ fvwm
    19593 ?        Ss     0:00      \_ <... more X clients>
    19433 ?        Ssl    0:00 /bin/dbus-daemon --fork --print-pid 4 --print-address
    19432 ?        S      0:00 /usr/bin/dbus-launch --exit-with-session /home/srk31/

    The explanations for this were somewhat convoluted. Firstly, the Fedora xinit scripts (FC7) do this (in /etc/X11/xinit/xinitrc-common, sourced from /etc/X11/xinit/Xsession):

    # Prefix launch of session with ssh-agent if available and not already running.
    SSH_AGENT=
    if [ -x /usr/bin/ssh-agent -a -z "" ]; then
        if [ "x" != "x" ]; then
            SSH_AGENT="/usr/bin/ssh-agent /bin/env TMPDIR="
        else
            SSH_AGENT="/usr/bin/ssh-agent"
      fi
    fi

    ...and later (in /etc/X11/xinit/Xsession)...

    # otherwise, take default action
    if [ -x "/.xsession" ]; then
        exec -l  -c "  /.xsession"
    elif [ -x "/.Xclients" ]; then
        exec -l  -c "  /.Xclients"
    elif [ -x /etc/X11/xinit/Xclients ]; then
        exec -l  -c "  /etc/X11/xinit/Xclients"
    else
        # should never get here; failsafe fallback
        exec -l  -c "xsm"
    fi

    In other words, they test whether an ssh-agent is running, and arrange to start one if not. But in between testing and starting one, they run a shell---which naturally starts my login scripts. These check for ssh-agent themselves and, finding none, start one. Then later, the Fedora scripts start another one. It's a classic “unrepeatable read” race condition, but without any concurrency---just interleaving of foreign code (my login scripts).

    Next, why wasn't my session showing up as a child of one of the ssh-agent processes? ps's output was doubly confusing because the top of my process tree was a bash -l .xsession process, when that's the last to be launched by the sequence initiated in the Fedora scripts! Well, strace revealed that my processes were using the clone() system call to spawn new processes (which is subtly different from fork(), in that it allows shared address spaces and hence multithreaded programming). As we know, when a process clones itself in order to start a new process, one of the resulting pair replaces itself with the new process image, while the other continues on its merry way. In the case of both ssh-agent and dbus-launch, the parent process was the one which replaced itself, leaving the child to continue the work of SSH agentery or DBUS launchery or whatever. This is really confusing because it contradicts the usual expectations about causal ordering from parent to child processes---but it's perfectly allowed, and has always been possible in Unix.

    What was the fix? Sadly, there isn't a good one---I don't have the permission to edit the Fedora scripts on my Lab machine, and there's no configurable flexibility for disabling the ssh-agent launching or fixing the racey logic. So I added a hack to my .xsession shell script which detects the case where SSH_AGENT_PID is already set to a child of the shell process (since the ssh-agent my scripts created is a sibling) and if so, kills that process and re-sets SSH_AGENT_PID to the one I created earlier (which, handily, I keep stored in ${HOME}/.ssh/agent-$(uname -n)). As usual, completely horrible.

    [/devel] permanent link contact

    Sat, 22 Aug 2009

    C++ Gotme number 1: operator visibility

    I'm doing some C++ coding at the moment. It's a language well-known for “gotchas”, so although it'd be presumptuous for me to assume they'd get you as well, dear reader, I'm going to document some of the things that've caught me out.

    namespace dwarf {
    	namespace encap {
    		typedef Dwarf_Loc expr_instr;
    		bool operator==(const expr_instr& e1, const expr_instr& e2);
    		bool operator!=(const expr_instr& e1, const expr_instr& e2);
    		typedef struct expr
    		{
    			Dwarf_Half hipc;
    			Dwarf_Half lopc;
    			std::vector m_expr;
    			expr(const Dwarf_Locdesc& desc) : hipc(desc.ld_hipc), lopc(desc.ld_lopc),
    				m_expr(desc.ld_s, desc.ld_s + desc.ld_cents) {}
    			bool operator==(const expr& e) const 
    			{ 
    				//expr_instr e1; expr_instr e2; // test
    				return hipc == e.hipc &&
    					lopc == e.lopc &&
    					//e1 == e2; // test
    					m_expr == e.m_expr; // error!
    			}
    			bool operator!=(const expr& e) const { return !(*this == e); }
    		} loc_expr;
    	}
    }
    

    Here we have some code from my C++ binding of libdwarf. I'm trying to use std::vector's builtin == operator to define equality on my struct expr, which is essentially a wrapper for vectors of Dwarf_Loc objects, where Dwarf_Loc is a structure defined by libdwarf. Here's what g++ makes of the above code:

    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.3.3/../../../../incl
    ude/c++/4.3.3/bits/stl_algobase.h: In static member function ?static bool std::_
    _equal<_BoolType>::equal(_II1, _II1, _II2) [with _II1 = const Dwarf_Loc*, _II2 =
     const Dwarf_Loc*, bool _BoolType = false]?:
    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.3.3/../../../../incl
    ude/c++/4.3.3/bits/stl_algobase.h:824:   instantiated from ?bool std::__equal_au
    x(_II1, _II1, _II2) [with _II1 = const Dwarf_Loc*, _II2 = const Dwarf_Loc*]?
    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.3.3/../../../../incl
    ude/c++/4.3.3/bits/stl_algobase.h:956:   instantiated from ?bool std::equal(_II1
    , _II1, _II2) [with _II1 = __gnu_cxx::__normal_iterator > >, _II2 = __gnu_cxx::__normal_itera
    tor > >]?
    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.3.3/../../../../incl
    ude/c++/4.3.3/bits/stl_vector.h:1111:   instantiated from ?bool std::operator==(
    const std::vector<_Tp, _Alloc>&, const std::vector<_Tp, _Alloc>&) [with _Tp = Dw
    arf_Loc, _Alloc = std::allocator]?
    dwarfpp.hpp:341:   instantiated from here
    /local/scratch/srk31/opt/bin/../lib/gcc/i686-pc-linux-gnu/4.3.3/../../../../incl
    ude/c++/4.3.3/bits/stl_algobase.h:795: error: no match for ?operator==? in ?* __
    first1 == * __first2?
    make: *** [dwarfpp.o] Error 1
    

    Clearly, my definition for operator== isn't being found. But if I uncomment the blamed line and replace it with some dummy code (commented above) which invokes the operator for two dummy objects, rather than vectors, it works fine. Why can I compare two Dwarf_Locs but not two vectors thereof, when vector defines a perfectly good operator== that should just invoke mine?

    The answer is that vector's operator== can't see my operator== definition, because of namespaces. According to Stroustrup (C++PL, third edition, section 11.2.4):

    Consider a binary operator @. If x is of type X and y is of type Y, x@y is resolved like this:

    So the code in std::vector clearly can't see my definitions in dwarf::encap::. However, the reason that this isn't such a common problem is that I'm defining operators on a type, namely Dwarf_Loc, that was defined not by me but in the pre-existing C library that I'm wrapping. I lazily dumped all of libdwarf's definitions into the global namespace, so the quick fix is to define my operator in the global namespace too.

    namespace dwarf {
    	namespace encap { typedef Dwarf_Loc expr_instr; }
    }
    bool operator==(const dwarf::encap::expr_instr& e1, const dwarf::encap::expr_instr& e2);
    bool operator!=(const dwarf::encap::expr_instr& e1, const dwarf::encap::expr_instr& e2);
    namespace dwarf {
    	namespace encap {
    		//typedef Dwarf_Loc expr_instr;
    		typedef struct expr
    		{
    			Dwarf_Half hipc;
    			Dwarf_Half lopc;
    			std::vector m_expr;
    			expr(const Dwarf_Locdesc& desc) : hipc(desc.ld_hipc), lopc(desc.ld_lopc),
    				m_expr(desc.ld_s, desc.ld_s + desc.ld_cents) {}
    			bool operator==(const expr& e) const 
    			{ 
    				//expr_instr e1; expr_instr e2;
    				return hipc == e.hipc &&
    					lopc == e.lopc &&
    					//e1 == e2;
    					m_expr == e.m_expr; // okay
    			}
    			bool operator!=(const expr& e) const { return !(*this == e); }
    		} loc_expr;
    	}
    }
    

    Note that if I'd done the right thing and wrapped libdwarf's definitions into some non-global namespace, say dwarf::lib, I'd still need to do something similar, because my operator definition won't be found if I put it in a different namespace like dwarf::encap, even though that's the namespace containing the code which actually needs the operator to be defined.

    Well, it certainly got me... now, on with coding.

    [/devel] permanent link contact

    Mon, 17 Aug 2009

    Ruby fails

    I'm sure Ruby is a nice language, but I get slightly annoyed by two things whenever I try to install a Ruby program. One is that it has its own build system, as a replacement for Make et al---there are things called Rakefiles in the source distributions. The other is that it also has its own configuration and deployment system, based on the setup.rb script. These things annoy me because they're yet more tools that I need to learn, and for no good reason. Python is guilty of the same sins too. A new language shouldn't entail a whole new build system. (I won't go into why, but hopefully you don't need me to. I would also complain that there are too many languages, but I'll save that too.)

    What really annoys me about Ruby is that setup.rb is broken, because it doesn't deal with prefixes properly. If I do ruby setup.rb config --prefix=${HOME}/opt, it still tries to install most of the program under /usr. So I tried giving the --prefix option to ruby setup.rb install too, but that doesn't do the right thing either. Instead it creates me a ${HOME}/opt/usr hierarchy and puts the stuff it was putting in /usr there.

    I might as well come clean and admit that the only Ruby program I routinely install is iplayer-dl. Anyway, my next attempt: configure everything using paths relative to ./, then supply the overall prefix at install time. That doesn't work either---setup.rb interprets the ./ passed with --prefix, so you get install destinations relative to your configure directory. But only for files affected by --prefix, which isn't all of them.

    Next attempt: configure everything relative to the root directory, then supply the prefix at install time. This does work, but you wouldn't guess from the output of ruby setup.rb install.

    $ rubyver=$( ruby --version | sed 's/ruby \([0-9]*\.[0-9]*\)\..*/\1/' )  # horrible
    $ rubyarch=$( uname -i )-$( uname -s | tr '[:upper:]' '[:lower:]' )           # horrible
    $ ruby ./setup.rb config --prefix='/' --sysconfdir=/etc \
    --libruby=/lib/ruby --librubyver=/lib/ruby/ --librubyverarch=/lib/ruby// \
    --siteruby=/lib/ruby/site_ruby --siterubyver=/lib/ruby/site_ruby/ --siterubyverarch=/lib/ruby/site_ruby//
    $ ruby $ ruby ./setup.rb install --prefix=${HOME}/opt
    rm -f InstalledFiles
    ---> bin
    mkdir -p /home/srk31/opt//bin
    install iplayer-dl //bin//iplayer-dl
    install iplayer-dl-gui //bin//iplayer-dl-gui
    <--- bin
    ---> lib
    mkdir -p /home/srk31/opt/lib/ruby/site_ruby/1.8
    install iplayer.rb /lib/ruby/site_ruby/1.8/
    ---> lib/iplayer
    mkdir -p /home/srk31/opt/lib/ruby/site_ruby/1.8/iplayer
    install downloader.rb /lib/ruby/site_ruby/1.8/iplayer
    install subtitles.rb /lib/ruby/site_ruby/1.8/iplayer
    install metadata.rb /lib/ruby/site_ruby/1.8/iplayer
    install preferences.rb /lib/ruby/site_ruby/1.8/iplayer
    install browser.rb /lib/ruby/site_ruby/1.8/iplayer
    install version.rb /lib/ruby/site_ruby/1.8/iplayer
    install errors.rb /lib/ruby/site_ruby/1.8/iplayer
    ---> lib/iplayer/gui
    mkdir -p /home/srk31/opt/lib/ruby/site_ruby/1.8/iplayer/gui
    install main_frame.rb /lib/ruby/site_ruby/1.8/iplayer/gui
    install app.rb /lib/ruby/site_ruby/1.8/iplayer/gui
    <--- lib/iplayer/gui
    <--- lib/iplayer
    <--- lib
    
    So, handily it's told me that it installed a bunch of stuff in /lib/ruby, which is exactly what I didn't want it to do. But, since I ran it without elevated permissions, I know that it can't have succeeded in doing that---yet there were suspiciously no error messages. Lo! Despite what it printed out, it has actually put the lib files in ${HOME}/opt/lib/ruby just as I wanted. Now, why was that so difficult?

    To make matters worse, you of course have to set your Ruby path to get the deployed program to work, and that is also horrifying:

    $ export RUBYLIB=${HOME}/opt/lib/ruby:${HOME}/opt/lib/ruby/site_ruby:${HOME}/opt/lib/ruby/site_ruby/1.8:
    
    ---disgusting, especially embedding the 1.8 version number in the path which will be seen (and interpreted) by any version of Ruby at all. It's following the pattern established Python, of course, and since Python has some reasonably sane people behind it I'm tempted to suspect that this ludicrous scheme has been selected for a reason---but even if this suspicion is correct, it'll have to be a very good one, and I somehow doubt that.

    [/devel] permanent link contact

    Fri, 13 Mar 2009

    ANTLR recipes

    I've been fiddling with ANTLR recently, trying to create a parser for my Cake language. Unfortunately I found it trickier than I was hoping, and this forced me to take a step back and create some simpler grammars so I could get the hang of how to realise some common grammatical features in ANTLR. Specifically, I looked at four features: recursive prefixing (i.e. a recursive unary prefix operator, so right-recursive), recursive suffixing (the same but suffixing, left-recursive), right-associative binary operators and left-associative binary operators.

    The appeal of ANTLR has been its ability to build ASTs in special-purpose syntax (agnostic towards the host language) rather than relying on semantic actions. Unfortunately it took me a while to get the hang of these tree construction features. The basics are here, but here's some helpful extra snippets and emphasis that took me a while to discover.

    I think that's all for now. I still have to get a good handle on operator precedence, which may or may not spawn another post.

    [/devel] permanent link contact

    Fri, 27 Feb 2009

    Standalone ksmserver, part 1

    I've been looking for an X11 session manager that actually works recently (since sadly xsm doesn't fit that bill) and was experimenting with KDE's session manager. It's peculiarly underdocumented but seemed like it should have all the functionality I needed, and should also be reasonably well-used and well-tested. So I was a bit disappointed when my basic set-up attempt at integrating it with fvwm appeared not to work. I was simply replacing the fvwm command in my .xsession with the following:

    ksmserver -r -w fvwm

    which should launch fvwm rather than kwin. Then in my fvwm configuration, to end the session properly when I quit the window manager, I tried the following.

    AddToFunc SessionInitFunction
     + I Exec echo hello from SessionInitFunction
    
    AddToFunc SessionExitFunction # kill ksmserver when we exit
     + I Exec dcop ksmserver ksmserver logout 0 2 2  
    
    ## from http://andrejserafim.wordpress.com/2008/05/16/kde-shutdown-logout-restart/
    #First parameter: confirm
    #Obey the user?s confirmation setting: -1
    #Don?t confirm, shutdown without asking: 0
    #Always confirm, ask even if the user turned it off: 1
    #Second parameter: type
    #Select previous action or the default if it?s the first time: -1
    #Only log out: 0
    #Log out and reboot the machine: 1
    #Log out and halt the machine: 2
    #Third parameter: mode
    

    (Thanks to Andrej Kazakov for the summary of logout's invocation that I pasted above, in turn extracted from the KDE source code.)

    Of course, it didn't work, so I put my developer hat on and had a look. Attaching gdb revealed that the session manager was entering a (non-tight) endless loop when I requested the logout---setting a timer which fired after one second, disabled itself and then recursively re-started the process. The problem was that the session manager has an internal state machine, and in my case it was stuck in state AutoStart0, whereas it was expected to end up in Idle after a while---the only state from which it can initiate a shutdown.

    To get out of AutoStart0, and subsequent start-up states, you have to manually call a bunch of other DCOP methods with names like autoStart0Done, kcmPhase1Done and so on. This is among what startkde does for KDE users, after performing various KDE-specific configuration updates. (These updates are exactly what a non-KDE user doesn't want to happen, at least in my case---since one major reason I don't use KDE is that I like to stick with the simple X11-provided ways of controlling backgrounds, cursors, input devices and so on.) We can manually invoke the DCOP methods to signal that it's time for the next state.

    We successfully avoid most of the KDE interference in this way, because the relevant other processes (like kcminit) haven't been launched. However, we do get some, because the klauncher process does exist -- it's started when running any KDE app if not already running. In particular, ksmserver's signals to klauncher have the unwanted consequence of starting up a bunch of “autostart” applications, like the KDE desktop background and panel, that have shortcuts in ~/.kde/Autostart and /share/autostart/. To avoid this, we tell the launcher to bump our start-up “phase”, which is just an integer, to a high value---since autostart apps are marked with a “phase” attribute, and are only started when moving through that phase. ksmserver only covers as far as phase 2, so we can set the phase to 3 and they won't be started up. So, here's my final fvwm config for use with ksmserver.

    [Update, 2009-03-02: not so fast! What I had here earlier didn't quite work, because fvwm doesn't necessarily execute each part of a compound function in sequence. So instead, it needs to all be rolled into one command. What's more, I had to hack in an arbitrary sleep because, depending on how long it takes to start up all the saved clients, the kcmPhase2Done call may come too soon (state machine not yet progressed into FinishingStartup, I think) and be ignored (causing ksmserver to get stuck in the latter state). Now, what follows does seem to work.]

    AddToFunc SessionInitFunction
     + I Exec dcop klauncher klauncher autoStart 3; \
    dcop ksmserver ksmserver autoStart0Done; \
    dcop ksmserver ksmserver kcmPhase1Done; \
    dcop ksmserver ksmserver autoStart1Done; \
    dcop ksmserver ksmserver kcmPhase2Done; \
    sleep 4; dcop ksmserver ksmserver autoStart2Done
    
    # signal ksmserver when we exit
    AddToFunc SessionExitFunction 
     + I Exec dcop ksmserver ksmserver saveCurrentSessionAs "saved at previous logout"; \
     dcop ksmserver ksmserver logout 0 0 3 
    

    I still haven't actually made the session management functionality work, which I think is going to take some more commands along “save” and “restore” lines. [Update, 2009-03-02: turns out to need just the one “save” command at exit, as shown above -- restoring happens by default, using the magic session name shown.] My ultimate goal is the ability to start multiple ksmserver processes, so that I can independently save and restore separate groups of clients within a single X login session. Fingers crossed... if it's nontrivial I'll write a follow-up post. There's also the joy of KDE 4 to think about later, which has exchanged DCOP for D-BUS and may well alter the undocumented behaviour I'm relying on.

    The Important Science to take away from all this is I suppose that interface protocol specifications and timing specifications are useful! Not just for checking, which seems a long way off here, but just for understanding. It'd also be useful to have more convenient support for interposing on DCOP communication, by having client-specific message routing (or name-bindings) so that the “phase” hack could be handled a bit more elegantly by cutting off the connection to klauncher.

    [/devel] permanent link contact

    Thu, 19 Feb 2009

    Initialization order

    I just spent an age tracking down a bug in a very simple tool I'm writing using the GNU BFD. It turns out that when opening a file for reading with BFD, various API calls are not valid until bfd_check_format_matches has been called and has returned successfully. This doesn't appear to be documented in the current texinfo manual. In fact, even the need to call bfd_init is not very clearly documented---but at least that one is fairly predictable. Most libraries, particularly those which maintain a lot of state behind the scenes, have an init function. Additional initialization steps, like the one I ran afoul of, are not at all guessable and should really be clearly signposted in the documentation. It was only by careful examination of what objcopy does that I could track down the bug.

    It's well-known in the research world that interfaces come with protocol constraints, and it'd be nice to check client code against these just as we already do with function signatures. Initialization constraints are by far the most common example of protocol constraint. But rather than just checking satisfaction of these constraints, why should we even have to write the code at all? Why can't our toolchain insert initialization calls in the right places automatically? In fact at least one research linkage tool (Knit) can already do this.

    How far can we generalise this constraint-solving? Usually protocol specifications use a finite-state model of component state, which are clearly good for a wide range of protocols but not all (consider a stack). Of course, we want our checking/solving to be decidable, and more complex models quickly lose this property, although I'm not sure whether a pushdown automaton model would necessarily be undecidable. More practically, Knit essentially supports a single bit of state (initialized or not) for each component (or actually a tri-state, because it supports finalizers too). If we wanted to generalise this even to a general finite-state model, we'd get ambiguity: say we had to call function A one or more times before calling function B, how many calls would we insert? Maybe “the smallest string” would be a sensible policy, but that would be ambiguous in some cases, and it's not clear it's always a good choice anyway. In protocol adaptation, there is invariably some programmer-provided “specification” to answer these questions, usually by constraining the generated automaton.

    The BFD example is unusual in another way, in that it requires a successful call to the init function, not just any call. So our protocol checker/solver has to understand return values (and of course arguments) as well as function names. It's about time I re-examined the research literature on protocol adaptation and worked out how much of it has actually been turned into practical tools that address these issues. In due course I'm hoping to build a practical subset (or superset) of protocol adaptation functionality into Cake. One thing at a time though... first I must master that pesky BFD.

    [/devel] permanent link contact

    Fri, 23 Jan 2009

    Library path strangenesses

    It's about time I began sharing the hard-learnt arcane development knowledge I've managed to pick up. One of the most annoying “features” I've had to reverse-engineer is in the behaviour of Linux's dynamic linker, ld-linux. I have a lot of directories in my LD_LIBRARY_PATH environment variable.

    srk31@font:~/scratch/kde$ echo 
    /home/srk31/scratch/opt/lib:/home/srk31/scratch/kde/lib:/usr/lib/mysql:/usr/lib/opensync/:/usr/lib/o
    pensync/formats/:/usr/lib/opensync/plugins/:/usr/lib/qt-3.3/lib:/usr/lib/wine/:/home/srk31/opt/lib:/
    lib:/usr/lib 
    

    They all get searched as you'd expect. Each directory is expanded to include searches for various architecture-specific variant subdirectories like tls, sse2, nosegneg etc.

    srk31@font:~/scratch/kde$ LD_DEBUG=libs ldd bin/konsole 2>&1 | head
          5666:     find library=libtinfo.so.5 [0]; searching
          5666:      search path=/home/srk31/scratch/opt/lib/tls/i686/sse2/nosegneg:/home/srk31/scratch/
    opt/lib/tls/i686/sse2:/home/srk31/scratch/opt/lib/tls/i686/nosegneg:/home/srk31/scratch/opt/lib/tls/
    i686:/home/srk31/scratch/opt/lib/tls/sse2/nosegneg:/home/srk31/scratch/opt/lib/tls/sse2:/home/srk31/
    scratch/opt/lib/tls/nosegneg:/home/srk31/scratch/opt/lib/tls:/home/srk31/scratch/opt/lib/i686/sse2/n
    

    (snipped some -- view source for more)

    osegneg:/home/srk31/opt/lib/i686/sse2:/home/srk31/opt/lib/i686/nosegneg:/home/srk31/opt/lib/i686:/ho
    me/srk31/opt/lib/sse2/nosegneg:/home/srk31/opt/lib/sse2:/home/srk31/opt/lib/nosegneg:/home/srk31/opt
    /lib            (LD_LIBRARY_PATH)
          5666:       trying file=/home/srk31/scratch/opt/lib/tls/i686/sse2/nosegneg/libtinfo.so.5
          5666:       trying file=/home/srk31/scratch/opt/lib/tls/i686/sse2/libtinfo.so.5
          5666:       trying file=/home/srk31/scratch/opt/lib/tls/i686/nosegneg/libtinfo.so.5
    

    Notice that in my LD_LIBRARY_PATH, /usr/lib and lib are right at the end. Once I tried tweaking the ordering so that some paths came after these -- they were “backup” libraries I'd compiled myself in case the machine I was using didn't have them, but I wanted to pick up any local ones in preference. Unfortunately, this doesn't work. ld-linux will ignore all directories after the first directory it encounters that is part of the “system library path”.

    srk31@font:~/scratch/kde$ LD_LIBRARY_PATH=/usr/lib:${LD_LIBRARY_PATH} LD_DEBUG=libs ldd bin/konsole 2>&1 | head
          5687:     find library=libtinfo.so.5 [0]; searching
          5687:      search path=/usr/lib/tls/i686/sse2/nosegneg:/usr/lib/tls/i686/sse2:/usr/lib/tls/i68
    6/nosegneg:/usr/lib/tls/i686:/usr/lib/tls/sse2/nosegneg:/usr/lib/tls/sse2:/usr/lib/tls/nosegneg:/usr
    /lib/tls:/usr/lib/i686/sse2/nosegneg:/usr/lib/i686/sse2:/usr/lib/i686/nosegneg:/usr/lib/i686:/usr/li
    b/sse2/nosegneg:/usr/lib/sse2:/usr/lib/nosegneg:/usr/lib                (system search path)
          5687:       trying file=/usr/lib/tls/i686/sse2/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/i686/sse2/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/i686/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/i686/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/sse2/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/sse2/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/libtinfo.so.5
    

    I have no idea why it does this, but would guess it's intended behaviour for some reason. It's annoying though.

    Oh, and another gotcha relating to LD_LIBRARY_PATH is that if you're using gdb, it seems you must set solib-search-path to match your LD_LIBRARY_PATH, because gdb seems to ignore the latter (again, probably for some reason or other). And if you ever use a frontend to gdb, like DDD, that probably has its own setting too. There is so much fun to be had with these things.

    [/devel] permanent link contact


    Powered by blosxom

    validate this page