Rambles around computer science

Diverting trains of thought, wasting precious time

Thu, 19 May 2011


At the MMNet workshop in Glasgow last week, I talked about memtables. These are an efficient associative data structure, built using virtual memory support on modern OSes (currently implemented for Linux only), that are useful whenever you want to key a table on addresses in memory. See my slides for more.

Since entries with numerically similar keys are stored close to each other, memtables are, like certain other associative data structures, amenable to searching within a key range as well as exact-match lookups. By contrast, hash tables can't do this. (That said, a hash table supporting duplicate keys can be used to store items grouped into small equivalence classes. This is sometimes good enough, and could be made to work in my case. Nonuniform key duplication will mess up the O(1) nature of hash tables though.)

Memtables seem like they could be useful in lots of places. I invented them for DwarfPython as a fast way of storing and retrieving metadata given a key that may be an interior pointer (hence the searching requirement). I'm also (soon) using them in Cake as a fast way of tracking what objects are associated with what other objects.

The key space doesn't have to be addresses. It's possible we could even use memtables for free chunk binning, since large sizes are sparsely used. I need to do some more experiments to establish this.

The implementation comes in two parts:

  • A generic set of malloc hooks for glibc: these hooks aree “generic” in that they're designed to be easily specialised for various conceivable kinds of instrumentation. They're not generic with respect to the allocator---sadly they're specific to glibc, but most mature allocators should have some similar mechanism. The key usefulness in these hooks is factoring the various cases of the malloc API ---specifically the complex behaviour of realloc, but also other annoyances including memalign and null frees--- into an easy-to-use set of higher-level hooks. These are likely (but not guaranteed) to be a better match for whatever your instrumentation is doing. For example, defining a post_successful_alloc() function will hook all events that allocate a new heap block, whether they originated in a malloc(), a realloc() or a memalign().
  • a generic memtable library: this will appear soon! It's a set of hook definitions that maintain a memtable, and a lookup function.
  • Memtables are strictly faster than a hash table, at least for lookups, because they are basically a hash table without the hashing. At least for most applications of memtables, the table itself acts as an index for a bunch of linked lists---call them bins or buckets. Rather than mapping the key space onto a smaller hash space in order to keep the index small, we index directly by the key, and rely on the virtual memory trick to keep the index small. Since we can only save page-sized chunks of space, the key-space really needs to be used in a clustered and/or very sparse fashion. Just one used key per page of index is enough to allocate the whole table in physical memory, which we don't want. So if your table has four-byte entries, say, uniform key usage should be a lot less than one per thousand possible keys---but clusters of many are okay, so long as they're thousands apart.

    [/devel] permanent link

    Powered by blosxom

    validate this page