MPhil course in Multicore Programming - Exercise Sheet 1 Version of Nov 29 2013 Due: Monday 27 January 2013. Please hand to student admin, with a coversheet. 1. Read the chapters of the Intel SDM and AMD APM about memory ordering: Intel Vol. 3 Chapter 8: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html AMD Vol. 2 Chapter 7: http://developer.amd.com/Resources/documentation/guides/Pages/default.aspx Consider this x86 example: Initially all registers and [x] and [y] are 0. Thread 0 Thread 1 Thread 2 MOV [x] <- 1 MOV EAX <- [x] MOV [y] <- 1 MOV EBX <- [y] MOV ECX <- [x] Finally: Thread 1: EAX=1 Thread 1: EBX=0 Thread 2: ECX=0 a) Is this allowed with respect to an SC semantics? b) Can you argue whether this is allowed or not with respect to each of those chapters? c) Prove whether or not it is allowed with respect to the x86-TSO abstract machine. 2. Using ppcmem, http://www.cl.cam.ac.uk/~pes20/ppcmem, for each of the following tests, either explain why it doesn't have any non-SC behaviour or describe the abstract-machine trace of one non-SC behaviour (explain what's going on, but don't write out the full state after each step of the trace in excruciating detail). a) WRC+lwsync+isync b) IRIW+lwsync+lwsync c) MP+lwsync+addr d) SB+sync+lwsync 3. Give two different implementations (code implementing lock and unlock) for a simple spinlock in C11, using different atomic primitives (and not using the native C11 lock and unlock!). Explain why they are correct, being clear about what that means. You may want to use cppmem to explore their behaviour. 4. Suppose for (a) and (b) that we have an SC machine (compiler and hardware). a) Look at the naive stack implementation in stack-na.c and consider the concurrent client concurrent_read_client(). What can go wrong? Briefly describe two ways to fix it and their pros and cons. b) An alternative implementation uses a CAS, as in the basic Treiber stack in stack-sc.c. What can go wrong with this? c) Now consider that code running on Power or ARM hardware, but assume the compiler does no reordering. What barriers are needed and why? d) And now consider the C11 version in stack-weak.c. Is that sufficient? e) Is it unnecessarily strongly synchronised? f) Does this implementation provide an SC interface? In particular, can you construct analogues of the MP and IRIW litmus tests using push() and pop() instead of reads and writes, and, if so, how can they behave?