CONCURRENT AND DISTRIBUTED SYSTEMS LECTURE 8 TRANSCRIPT

(C) 2023 DJ Greaves, University of Cambridge, Computer Laboratory.


Lock-free Programming, Transactional Memory and Summary.

Transactional memory was not lectured in October 2023. Time did not permit.


Well, what's wrong with locks? They're lovely, aren't they?

No, they're difficult to get write, especially when the locks are fine grained.

And they don't scale well. If we make the locks too coarse grained we deny too much parallelism via mutual exclusion.

And they don't compose well. They can lead to deadlock.

And they can lead to poor cache behaviour with cache-lines pinging backwards and forwards.

And they can make threads become correlated with each other in terms of how they access shared locks or IO devices.  When a mutex is
wanted but held, the implementation will yield to the scheduller after a few "fast path" attempts at spinning.  Yielding invokes the
scheduller which is additional code being executed and hence degrades system performance overall.  It's possible to get a convoy, a
positive feedback effect, where when a thread encounters a lock that is held it becomes more likely that other attempts for other
locks find them held.


We've also seen the problem of priority inversion.

And also locking has overheads of course: even if a lock is free, extra code has been run to take and release it.

So this brings us to the subject of lock-free programming.

Of course, we don't want to abandon safety. We're going to need to program very carefully, but hopefully we can localise that
careful programming effort. Do it once and use it many times.

For instance, we can have carefully crafted implementations of components that are commonly used by multiple threads, like queues
and buffers.

But these implementations won't use any locks inside them or will use far fewer than a more naive approach would require.

So let's see if we can use our atomic instructions such as test-and-set compare and swap load-linked/store-conditional and see if we
can use them directly on data structures rather than just using them to implement lock.

We're assuming we have a shared memory, cache-consistent system that means something written by one processor is seen by another.
That's what cash consistency gives us.

We'll also sometimes need to make some assumptions about sequential consistency which is with respect to the order in which writes
generated by one core are seen to happen by others.

Our low-level operations will just include the everyday load instruction, which is atomic for some word size.  And we'll have a
corresponding write operation which translates directly into a store instruction of the underlying machine where a word of the
native word width is atomically updated to a new value.

But most importantly, we're going to have an atomic instruction. We're going to consider here the compare-and-swap primitive, which
compares in one atomic step the value of a memory location with the value of an argument to the compare and swap instruction, and if
they match, replaces that memory location with a new value, also specified as an argument to the instruction.
The primitive returns a Boolean value telling us whether it successfully matched and stored the new value.

And on more RISC-like architectures of course, as discussed earlier, compare exchange or compare-and-swap is replaced with the a
short piece of machine code which uses both the load-linked and the store-conditional instructions.

I'm only giving you a little taster in this lecture and we're just going to look at one example, which is a linked list which can be
used to maintain a set or a key/value store.

It will use the compare-and-swap instruction to update the pointers.

And we'll need three operations on the structure:
 1 find, to see if a member with a given key exists,
 2 insert, which will insert a new member in the linked list, and
 3 delete which will take an item out of the list.

The insert and delete operations may fail at first attempt, so they'll have to have a return value and then the user code can try
them again until they do work.

We'll assume that the hardware supports atomic operations on the pointer size of the machine. Pointer size on most processors today
is 64 bits, whereas the native machine word is 32 bits. So we're assuming full atomic operations on the store of essentially a
doubleword pointer, but that is the norm, especially for aligned data on today's architectures.

And we'll also assume full sequential consistency, which makes things nice and simple for the programmer.  This is the assumption
that if we store things in a particular order, other processes will read them back and see them happening. See those changes
happening in the same order.

Processor families do actually vary in the level of sequential consistency that they support.  For those who are interested, you
might find that code that was written for Intel and is now run on Arm may occasionally come a cropper, and in those situations, for
those who are interested, then you will have to insert manually fence instructions to reduce the amount to which the hardware
reorders the effects of reads and writes.  [See Figure 4.17 of Modern SoC Desing on Arm "Two code fragments using message-passing in
shared memory with explicit memory fences..."]


L8P21: Concurrent Find And Insert 

Here's an example. Let's consider a singly-linked list which is maintained using a lock-free data structure. So here's our initial
list we have a head pointer that connects to three items on the list. We have a 10 and a 30 and a 44.  An inavariant: the items are
held in ascending order.

And supposing we want to check whether item 20 is in the list, then we start at the head, we say "Let's look at this." 10 is not 20,
but it's less than, so we'll continue on. We'll look at this next item, 30.  That's not 20, but it's greater than 20. So given this
is in ascending order, item 20 is not in the list.

Now, supposing we want to add a new item in the list. Let's suppose we want to insert 20. Well, perhaps we've scanned down and we've
checked, that item 20 is not in the list, so now we're going to insert it.

Of course, doing those two steps separately doesn't sound very thread safe!  And indeed it won't be. Reading, of course, by itself
it's no problem; it's when we're also writing that we're going to have thread safety problems.


Let's proceed as follows. So let's create a new item which we want to insert, value 20.  We will create it on the heap somewhere and
we scanned down to find the insert point and make its tail pointer refer to the natural successor in the existing list, which is
item 30.

And now we naively thing all we need to do is an atomic store to change the tail pointer of the previous item to point to this new
item. Stores of pointers are atomic on all major architectures.  This atomic store should be intrinsically thread-safe, we think: we
have previous to that store not changed anything in terms of what others see and after that store we have changed exactly one thing,
atomically, and we have the new data structure. What could go wrong?

Well, sadly, there might be two writers. Suppose another thread comes along and it wants to store 25, hence an item with key 25.
This is going to need to go to the same place, It too creates its new instance on the heap and spots it needs to go
between the existing10 and 30.  So it too sets the new instance to refer to the natural successor, 30, and sets the tail pointer
here to point to there. Finally, it too makes its atomic store on the successor to 10.

Do we have a write-after-write failure?  Yes, this is not going to work. Only one thread, the one which does the later "atomic"
store of this tail pointer is going to successfully insert its item in the data structure.

So how could we mitigate this problem?

Well, what we could do is we could do a compare-and-swap. As before, we first create it on the heap and sets its tail pointer to the
natural successor. But before we make the actual insert store (setting the tail pointer of its predecessor), we can use CAS to
atomically check the previous value has not been touched.

So instead of just doing a simple store here, we usa a compare-and-swap. So one of the two contending threads, mentioned just now,
when it makes its compare-and-swap, the first one to do it, will find that the existing value of the tail pointer still
points to the natural successor. And so it will succeed.

Whereas the second one that comes along will fail because the atomic compare-and-swap has already updated the relevant pointer. It
can then naively start again in an automatic retry.  That should work.  Both situations are fine. It just needs to retry and
ultimately we'll have a compare-and-swap being used to insert the new item in the list.


Or perhaps a better implementation will avoid completely rescanning from the head --- you can work out the details yourself.


The so-called ABA problem can arise shortly after an item has been deleted from our lock-free list.  If the deleted item has been
placed on a free list under an arena-style (node pool) store manager it could, in rare cases, be reinserted in the list at the same
point but with a different key value, eg 27 instead of 25.  A racing thread that is inserting 26 can end up putting it after the 27
because that same heap memory item held 25 at the time it was checked!

The item was changed from state A (being on the list at that point) to state B (deallocated) back to state A again, but it now
represents something different.  ABA arises when an observer does not notice that something was changed and then changed
back again but really did need to know about this.


Dr Harris, lecturing the second part of this course is an expert on these topics, but as far
as I know, one solution to this problem is to 'quarantine' store that is being recycled: the arena manager needs
to delay handing out something that has just been freed.  This is highly likely to work, but can we prove correctness of such a scheme?  Ditto 'pointer tagging' where extra bits that are ignored by the ISA (eg top 16 bits in most A64 machines today, are filled with a fresh nonce on each allocation, or the bottom 4 bits in 16-byte aligned nodes etc..

Note the ABA problem also arises in a lock-free implementation of the arena manager itself, but this can easily be mitigated by
having a separate pool for each thread.


L8P22:  Concurrent Find+insert

There's also a possible race consideration between reading and writing the set.

If we have one thread that starts reading, checking for a key that another one is happens to be inserting with exactly that
key value. Then what is the result going to be?

Well, it doesn't really matter what the outcome there is. That is a genuine race.

And as long as the data structure maintained its integrity, then we are quite happy for our application program to be exposed to
scheduling variations in that sort of scenario. If it's really important, then we will have added higher-level locking mechanisms of
some nature or another, of the form we've already discussed.

We also need to consider the delete primitive, but I'll leave that as an exercise for you.

Obviously the main hazard is that as we delete an item, another thread has just tacked something on the end of it and that gets
deleted as well, accidentally.


L8P23:  Linearisability


I spoke about the potential race that arises when we have, say, a concurrent read alongside an update to a shared lock-free
structure.  What is the correct semantic in fact?  Well, we want the system to behave as though these concurrent operations on the
lock-free structure happened in some nominal serial order.

But we've already used the word serializable for defining the existence of a nominal ordering of concurrent actions in a transaction
processing system that was non strict, so we'll use the word linearizable here.  [More on this term in the second half.]

If the observed effect of concurrent operations on a shared lock-free data structure is the same as some putative serial
non-interleaved sequence of the same thread activity on that item, then we say the result is a linearizable schedule.

For the concurrent read and write, that we spoke of on our linked list just now, it's clear there's two outcomes: either the read
sees the effect of the write or it doesn't.

But of course, when there are more complicated data structures then it might not be so well-defined. We've looked previously, for
instance at moving money between bank accounts and the conservation of money law being apparently broken according to how the
interleaving of the movement and the total tallying, how those two procedures mesh with each other and give you a different answer
and hence all exactly the same things arise with these lock-free structures.

Lock-free data structures are always designed and coded to be completely tolerant to any possible interleaving of the threads that
are operating on them.  Or in the case where the designers really can't manage that, then you have to resort to putting a lock
in. But of course that is a little bit of a of a failure in some ways.

And on the sequential consistency aspect, as I've said already, there's the not only the interleaving of the different threads with
each other. There's the possible further interleaving and out-of-ordering of the effects of the threads in terms of how they become
apparent to each other.

It's worth pointing out that architectures vary in the side effects of some of the primitive operators. For instance, the compare
and swap operation may in most architectures be associated with an implicit fence instruction that ensures temporal separation of
reads happening before reads and writes happening before and reads and writes happening.  Whereas on the RISC-like processors, the
load-linked/store-conditional instructions might have no associated intrinsic fence behaviour and you have to explicitly put the
fence instructions in yourself. But again, these fence issues will be for consideration in other courses on computer architecture
and so on. Not for this course.


L8P24 (S/W) Transactional Memory (TM)  NOT LECTURED IN OCT 2023.

Let's look at one more concurrency paradigm to round off this half of the course. This is the transactional memory concept.

It can be implemented in hardware or software, or some combination of the two from the users point of view, it's the same paradigm
regardless of how it's implemented. And it's based on optimistic concurrency principles that we've seen before already.

So here's the first example. We're going to operate on an array called shared X. We're going to read and write it at address I or
subscript I and we're also going to read it at subscript J.

We're going to not necessarily know the values of I and J in advance, and we would like this to operate if possible, concurrently
with other threads that are operating on the same array but with different values of I and J.

Obviously, if we knew somehow that other threads were using completely different values for I and J (disjoint values), then we
should be fine. We wouldn't need to do anything special, but we're going to assume that by and large they'll be different, but
sometimes they may overlap, in which case we're going to have to worry about serialisation and racing.


Our first approach to coding this up is just to use a mutex. We put a lock and an unlock around our operation, and of course, even
if other threads have totally disjoint I and J, then they are not going to be able to participate. We have mutual exclusion and we
will be denying concurrency when perhaps we didn't need to.

If we have transactional memory support in our high level language, then there will be some keyword such as "atomic" which enables
us to delimit blocks that we wish to execute. 

To give the atomic semantics to this, we're going to use optimistic concurrency control under the bonnet.  Optimistic concurrency
control: it's optimistic that it's going to be successful, but sometimes it will fail and therefore we need to iterate with our
DO/WHILE loop as shown here so that we keep trying until we're successful.

And there's quite a lot of detail in any real implementation.

What's shown here is that we start the block with a transaction begin call, where we pass in our thread identity and the variables
that we're going to want to operate on. We might need to distinguish between read and write. In return we get some sort of handle
or transaction ID that we store in a local variable.

We then proceed with the body of our code, which here looks unmodified, but actually it is. The reads will operate on checkpointed
versions of the data, and the writes will be made to temporary data structures that need to be written back when we come to commit
this.

And actually, in reality, on modern processors, we _are_ operating on local copies because of the cache structure, the read data
will be stored in the cache lines and things that we've written will be stored in cache lines as well (or a write buffer), and will
be written out later on.

So we can already see that one possible route to implement this sort of thing might be to exploit that hardware that we already have.

The transaction COMMIT primitive is going to do the two things that we discussed before under optimistic concurrency control.  It
checks whether the data that we read has been more recently updated and is now dirty.  And it then checks that the values that we
need to write out will not overwrite values that have been written out by the commit of another transaction.

But if all is well, there have been no clashes. Then we will commit successfully. Otherwise we will just repeat all of the work of
the transaction.

What are the advantages then of transactional memory?

From the user's point of view, it's very simple. We just put ATOMIC around anything that we want to occur in isolation.

Well, you could say that we could do something similar to that with monitors, but from one monitor you can't call a method of
another monitor.  You're likely to very rapidly end up with deadlock if you try to do something like that.

Another advantage is that TM can be very fine grain and it can automatically take on an appropriate granularity.

So when we're operating on a large data structure, only the parts of it that need to be changed, like the sub-tree of a tree or a region of an array are going to get the copy-on-write and write-back treatment.

And another very attractive feature is that we can have our atomic blocks nesting.

There's a variety of implementation techniques, some involving more compile time and some involving more runtime work, and the way
that they nest will depend on the control flow of the programme and can vary from one run to another.

In our example here we have credit and debit operators. These both have their own atomic keyword and achieve atomic operation where they read and then write back to a shared variable.

But if we invoke this debit and credit primitives from inside a larger atomic block, such as the body of the transfer method here,
the effective range of the atomicity is dynamically expanded out so that both the debit and credit operations are under the same
atomic envelope, and they both take place or neither takes place.


L8P26 TM advantages

Transactional memory offers the following potential advantages.

For a start, it can't deadlock. There aren't any locks, so we don't have to worry about a locking order.  But as with all optimistic
concurrency schemes, there is a chance of livelock, where some group of transactions repeatedly try to commit, but each depends on
some data modified by another one, in which case there can be multiple attempts to commit before by chance, one of them manages to
get through.

But apart from that, there shouldn't be any other races.

You cannot forget to take out a lock because there aren't any. But you could forget to put an atomic a decoration around the
critical section. There may be some high-level language support for that.

You could look at the INCOMPREHENSIBLE pie language for instance, which has some support for transactional memory. It's a variant of
INCOMPREHENSIBLE.


In general, optimistic concurrency control gives us good scaling performance, and there's not any need to worry about fine-grained
locking as we've seen, it can be automatically applied to small regions of a shared data structure.

But even though we have support for fine-grain locking, there's still going to be a performance loss if we mark too much of our programme as atomic. This is going to reduce the amount of available concurrency, of course.

And if we have too little marked as atomic, then we're making a programming error and we will see race condition problems arising.


L8P27 TM is very promising

Overall transactional memory still looks like a very attractive paradigm for concurrency and parallel computing, much in the same
way as it's worked out very well for distributed computing.

TM gives us atomicity and consistency and isolation. Three of the things that we need from our transaction processing system, but it
doesn't help with durability.

But crash recovery is greatly simplified because we haven't dirtied any objects before we come to the commit stage. We've only
operated on copies.

TM essentially places a layer of processing between a program and the memory it operates on.

This additional layer can be implemented in a number of different points in the system. It could be done by macro expansion of the
high-level language. It could be in the HLL compiler, it could be in the virtual machine for something like Java or Python, or it
could be in the hardware platform itself.

In all these cases, the application program itself only needs very light modification, yet can benefit from fine-grained
concurrency.

If we look at the cache structures that we find in today's computers, then they already have many of the mechanisms that we need for
a program to operate on a local copy of the data and for that local copy to be dynamically created on demand.  The cache consistency
protocols that we have can be augmented to keep track more closely of what's been read and what's been written, but is not yet ready
to write back to where it should be, and that sort of thing.

Both Intel and Arm have introduced processors with hardware transactional memory support.  But there have been quite a few bugs,
some of them just security bugs and others being functionality bugs which have meant that these facilities have been offered and
then withdrawn again, and re offered several times over.

But T/M, it's not a panacea. The contention management can get ugly if we have excessive thrashing and livelock situations.  Our I/O
libraries have to be modified so that we can essentially undo an external operation.  If we're sending a message to something that's
outside the T/M system, then we need to keep that message in an output queue until we're sure that we've committed the relevant
transaction.

Overall, there have been quite a few attempts in the last decade to establish a good set of primitives for transactional processing system. Both the hardware implementation and software implementation.

No doubt Doctor X will say something about the distributed systems context.

And I see that there are some recent talks on adding transactional memory extensions to the RISC-V 5 instructions set for instance, that you might care to look up.


L8P28 Concurrent systems: summary

Concurrency is certainly essential in modern systems.

Earlier we presented this list of three main motivators:

The old one of overlapping I/O devices with computation to get a speed match and to decouple them.

Then, in today's era of multicore computers, we can't use the multi cores for a single program if we don't write that program in
some sort of concurrent/parallel way. And much the same goes for distributed systems.

And really, the principal differences are the lack of shared memory. And the error recovery procedures where we expect individual
computers to go up and down while the rest of the distributed system remains functional.

We've reviewed the main challenges raised by concurrent systems:

We need to ensure the safety against race conditions in our critical sections mutual exclusion.

We've needed to provide synchronisation primitives so that one thread or task can trigger another one.  We have to avoid issues of
liveness and deadlock, of course, while doing that.

The two major risks really are bugs and over engineering.  We can avoid bugs by running our programme on a single thread as far as
possible to start with. So build a sequential system first.

Whereas the over engineering, where we tend to put too many locks in, or overly deploying concurrency mechanisms, then these will
reduce performance and we can end up with a serialised system which doesn't exhibit parallel speed up as we add further processors.

And generally you should always use existing libraries and tools and design patterns rather than rolling your own, unless, that is,
you've got a brilliant new idea on how to do something better, or you're just wanting to re-implement an existing one to understand
it more fully.


L8 P29  Summary + Next Time

To summarise this lecture, we looked at the durability aspect of transaction processing systems:
crash recovery and using a write-ahead log and checkpoints to give that durability.

We had a brief digression at lock-free programming and then we looked at transactional memory, which doesn't give us that
durability, but it does meet the AC and I.

And next time we shall have Doctor X on distributed systems. Thank you for your attention.


END