inplace-stm v0.01
-----------------

This is an initial self-contained version of various 'in the clear'
algorithms for software transactional memory.  By 'in the clear' I
mean that the data manipulated by the STM is held in its ordinary
format, without any reserved space in the words stored, and that this
is entirely separate from the co-ordination data used by the STM
implementation.  This may make it easier to drop in the STM in place
of existing mechanisms for concurrency control because, for instance,
it is not necessary to introduce additional levels of indirection
through block-pointers.

However, note that it is far from clear how the performance of this
STM will compare with block-based ones or object-based ones.  The
expectation is that, as with conservative garbage collection, it
provides an easy path for incremental deployment and experimentation.
Currently there is no special support for read-only accesses during
transactions -- this will have a substantial impact on caching
performance.

There are three sub-directories:

 * "libstm" contains the implementation,

 * "test" contains two test files, "stm_test.c" which tests the basic
   update features of the STM and "wait_test.c" which tests the
   blocking "STMWait" operation,

 * "include" contains "stm.h", which defines the interface exported by
   the STM and which is included both by its implementation and by the
   test code.

The main STM configuration parameters are provided as C pre-processor
macros that are recognised by stm.c.  In outline:

 * Choose one of DO_STM_ABORT_SELF, DO_STM_DEFER_OTHER and
   DO_STM_ABORT_OTHER.  These control how one transaction behaves when
   attempting to commit.  DO_STM_ABORT_OTHER gives obstruction-free
   non-blocking behaviour.

 * Choose one of DO_SECONDARY_SIMPLE, DO_SECONDARY_RC,
   DO_SECONDARY_PD, DO_SECONDARY_SMR, DO_SECONDARY_PTB to select which
   mechanism is used to manage temporary storage space needed by the
   algorithm.

 * Set STM_OWNERSHIP_RECORDS to control how many separate 'ownership
   domains' exist in the STM.  A hash function (OREC_HASH) maps
   addresses to ownership domains.  Transactions can be committed
   concurrently if they do not have any ownership domains in common.

 * Set MAX_THREADS to be sufficiently large (maximum number of active
   threads at once -- i.e. threads that have called STMAddThread but
   not STMRemoveThread).

 * Set MAX_LOCATIONS to be sufficiently large (maximum number of
   locations that a transaction may access).

 * To select between processor architectures define either ENV_IA32,
   ENV_IA64, ENV_SPARC, ENV_SPARC_V9 in the GNUmakefiles.  The SPARC
   (v8plus) option is by far the best tested, then IA32.  Note that
   there are restrictions placed on the SPARC v9 and IA64
   architectures which do not support a double-word-width
   compare-and-swap:

    + the PTB memory allocation mechanisms cannot be used,

    + thread-local caches of allocated temporary structures cannot be
      balanced between threads (could be fixed by more careful
      implementation),

    + DO_STM_ABORT_OTHER is not available.

Other tuning parameters:

 * MALLOC_BLOCK_SIZE is the unit in which the STM requests more memory
   for temporary storage. 

 * OBJECT_CACHE_HWM is the maximum volume of re-usable storage that
   one thread may hold.  If it is exceeded then the thread passes some
   back (down to OBJECT_CACHE_LWM) to a shared pool.  Synthetic
   workloads otherwise showed net flows of storage from allocating
   threads to hoarding threads.  It is unclear if this is a practical
   problem.

 * DEFERRED_FREE_BATCH_SIZE governs the maximum number of objects that
   a thread using PTB or SMR may de-allocate before performing a scan
   to attempt to free them, making them available for re-allocation.

These are the known problems / limitations / ghastly hacks that exist
in the code:

 * The setting MAX_THREADS should not be necessary.  It's currently
   used at compile time to statically allocate arrays of guards for
   PTB and hazard pointers for SMR.  In each case dynamic allocation
   is possible and loops bounded by MAX_THREADS can be removed by
   proper implementation.

 * The per-transaction descriptors currently hold MAX_LOCATIONS
   entries.  These should be managed dynamically to allow
   MAX_LOCATIONS to be set realistically large without wasting space
   on small transactions.

 * Simple sorting and searching loops are used throughout under the
   assumptions that the number of locations updated is low.  This
   should be revisited with profiling.

 * The NO_INLINE macro is not implemented (and more generally has been
   placed in 'plausible' locations rather than based on profiling).

 * STMReadValue is full of spurious memory barriers.

Tim Harris
11 Feb 2003

