Computer Laboratory

Course pages 2015–16

Concurrent and Distributed Systems

Principal lecturer: Dr Robert Watson
Taken by: Part IB
Past exam questions

No. of lectures: 16 (Continued in Lent Term)
Suggested hours of supervisions: 4
Prerequisite courses: Operating Systems, Object-Oriented Programming
This course is a pre-requisite for Mobile and Sensor Systems (Part II).

Aims of the Michaelmas Term part of the course

The aim of the course is to introduce concurrency control concepts and their implications for system design and implementation.

Michaelmas Term Lectures (Concurrency)

  • Introduction to concurrency, threads, and mutual exclusion Introduction to concurrent systems; threads; interleaving; preemption; parallelism; execution orderings; processes and threads; kernel vs. user threads; M:N threads; atomicity; mutual exclusion; and mutual exclusion locks (mutexes).

  • More mutual exclusion, semaphores, producer-consumer, and MRSW Hardware foundations for atomicity; locks and invariants; semaphores; condition synchronisation; N-resource allocation; two-party and generalised producer-consumer; Multi-Reader, Single-Write (MRSW) locks.

  • CCR, monitors, and concurrency in practice Conditional critical regions (CCR); monitors; condition variables; signal-wait vs. signal-continue semantics; concurrency in practice (kernels, pthreads, Java).

  • Safety and liveness Safety vs. liveness; deadlock; the Dining Philosophers; resource allocation graphs; deadlock prevention, avoidance, detection, and recovery; livelock; priority inversion; priority inheritance.

  • Concurrency without shared data; transactions Active objects; message passing; tuple spaces; CSP; and actor models. Composite operations; transactions; ACID; isolation; and serialisability.

  • Further transactions History graphs; good and bad schedules; isolation vs. strict isolation; 2-phase locking; rollback; timestamp ordering (TSO); and optimistic concurrency control (OCC).

  • Crash recovery, lock-free programming, and transactional memory Write-ahead logging, checkpoints, and recovery. Lock-free programming and software-transactional memory (STM).

  • Concurrent systems case study. Concurrency in the FreeBSD kernel; kernel synchronisation before parallelism; Giant-locked kerels; fine-grained locking; primitives and strategies; lock order checking; network-stack work flows; performance scalability; the impact of changing hardware.

Objectives

At the end of the course students should

  • understand the need for concurrency control in operating systems and applications, both mutual exclusion and condition synchronisation;

  • understand how multi-threading can be supported and the implications of different approaches;

  • be familiar with the support offered by various programming languages for concurrency control and be able to judge the scope, performance implications and possible applications of the various approaches;

  • be aware that dynamic resource allocation can lead to deadlock;

  • understand the concept of transaction; the properties of transactions, how they can be implemented, and how their performance can be optimised based on optimistic assumptions;

  • understand how the persistence properties of transactions are addressed through logging; and

  • have a high-level understanding of the evolution of software use of concurrency in the operating-system kernel case study.

Recommended reading

* Bacon, J. & Harris, T. (2003). Operating systems: distributed and concurrent software design. Addison-Wesley.
Bacon, J. (1997). Concurrent systems. Addison-Wesley.
Tanenbaum, A.S. & van Steen, M. (2002). Distributed systems. Prentice Hall.
Coulouris, G.F., Dollimore, J.B. & Kindberg, T. (2005, 2001). Distributed systems, concepts and design. Addison-Wesley (4th, 3rd eds.).

Aims of the Lent Term part of the course

The aims of this course are to study the fundamental characteristics of distributed systems, including their models and architectures; the implications for software design; some of the techniques that have been used to build them; and the resulting details of good distributed algorithms and applications.

Lent Term Lectures (Distributed Systems)

  • Introduction to distributed systems; RPC Avantages and challenges of distributed systems; “middleware”; transparency goals; client-server systems; failures and retry semantics (all-or-nothing; at-most-once; at-least-once). Remote procedure call (RPC); marshalling; interface definition languages (IDLs); SunRPC; external data representation (XDR).

  • Network File System and Object-Oriented Middleware Network File System (NFS); NFSv2; NFSv3; scoping; the implications of a stateless design; performance optimisations. Object-oriented middleware (OOM); Corba ORBs, IDL; DCOM.

  • Practical RPC systems; clocks Remote method invocation (RMI); remote classes vs. serialisable classes; distributed garbage collection; XML-RPC; SOAP and web services; REST. Physical clocks; UTC; computer clocks; clock synchronisation.

  • Clock synchronisation; logical clocks Clock drift and compensation; Cristian’s Algorithm; Berkeley Algorithm; Network Time Protocol (NTP). Logical time, “happens-before”; Lamport clocks; vector clocks.

  • Consistent cuts, process groups, and mutual exclusion Consistent global state; consistent cuts. Process groups; FIFO ordering; receiving vs. delivering; causal ordering; total ordering. Distributed mutual exclusion; central lock servers; token passing; totally ordered multicst.

  • Elections, consensus, and distributed transactions Leader elections; ring-based algorithm; the Bully algorithm. Consensus. Distributed transactions; atomic commit protocols; 2-phase commit. Replication and consistency.

  • Replication in distributed systems, CAP, case studies Replication and consistency (cont); strong consistency; quorum systems; weak consistency; FIFO consistency; eventual consistency; Amazone’s Dynamo; session guarantees; Consistency, Availability and Partitions (CAP); Google datacentre technologies (MapReduce).

  • Further case studies, PubSub, security, NASD/AFS/Coda Google datacentre technologies (BigTable, Spanner). Access control and the access-control matrix; ACLs vs capabilities; cryptographic capabilities; role-based access control (RBAC); single-system sign-on. NASD, AFS, and Coda.

Objectives

At the end of the course students should

  • understand the difference between simple concurrent systems and distributed systems;

  • understand the fundamental properties of distributed systems and their implications for system design;

  • understand notions of time synchronisation, including logical clocks, vector clocks, and physical time;

  • be familiar with various approaches to data and service replication, as well as the concept of data consistency;

  • understand the effects of large scale on the provision of fundamental services and the tradeoffs arising from scale;

  • appreciate the implications of individual node and network communications failures on distributed computation;

  • be aware of a variety of tools used by distributed-system creators, such as RPC and object-oriented middleware (OOM);

  • be familiar with a range of distributed algorithms;

  • be familiar with a number of case studies in distributed-system design including: the Network File System (NFS), the Network Time Protocol (NTP), Java Remote Method Invocation (RMI), CORBA, the AFS and Coda filesystems, Network-Attached Secure Disks (NASD), and Google’s MapReduce, BigTable, and Spanner systems.

Recommended reading

* Bacon, J. & Harris, T. (2003). Operating systems: distributed and concurrent software design. Addison-Wesley.
Bacon, J. (1997). Concurrent systems. Addison-Wesley.
Tanenbaum, A.S. & van Steen, M. (2002). Distributed systems. Prentice Hall.
Coulouris, G.F., Dollimore, J.B. & Kindberg, T. (2005, 2001). Distributed systems, concepts and design. Addison-Wesley (4th, 3rd eds.).