Understanding Networked-Systems Performance

Principal lecturer: Prof Andrew Moore
Taken by: MPhil ACS, Part III
Code: P56
Term: Lent
Hours: 16 ((8 x 2hrs first half lecture, second half lab))
Format: In-person lectures
Class limit: max. 12 students
Prerequisites: A student must possess knowledge comparable to the undergraduate subjects on Unix Tools, C/C++ programming and Computer Networks. Optionally; a student will be advantaged by doing an introduction to Computer Systems Modelling; such as the Part 2 Computer Systems Modelling subject as well as having a background in measurement methodologies such as that of L50: Introduction to networking and systems measurements provided in the ACS/Part III curriculum.
Moodle, timetable

Aims

This is a practical course, actively building and extending software tools to observe the detailed behaviour of transaction-oriented datacenter-like software. Students will observe and understand sources of user-facing tail latency, including that stemming
from resource contention, cross-program interference, bad software locking, and simple design errors.
A 2-hour weekly hybrid, lecture/lab format permits students continuous monitored progress with complexity of tasks building naturally upon the previous weeks learning.

Objectives

Upon successful completion of the course, students will be able to:

Make order-of-magnitude estimates of software, hardware, and I/O speeds
Make valid measurements of actual software, hardware, and I/O speeds
Create observation facilities, including logs and dashboards, as part of a software system design
Create tracing facilities to fully observe the execution of complex software
Time-align traces across multiple computers
Display dense tracing information in meaningful ways
Reason about the sources of real-time and transaction delays, including cross-program resource interference, remote procedure call delays, and software locking surprises
Fix programs based on the above reasoning, making their response times faster and more robust

Syllabus

Measurement
Week 1 Intro, Measuring CPU time, rdtsc, Measuring memory access times, Measuring disks,
Week 2 gettimeofday, logs Measuring networks, Remote Procedure Calls, Multi-threads, locks
Observation
Week 3 RPC, logs, displaying traces, interference
Week 4 Antagonist programs, Logging, dashboards, profiling.
KUtrace
Week 5 Kernel patches, hello world, post-processing
Week 6 KUtrace multi-CPU time display
Week 7 Client-server KUtrace, with antagonists and interference
Week 8 Other trace mysteries

Assessment

This is a Lab-based module; assessment is based upon reports covering guided laboratory work performed each week. Assessment will be via two submissions:

20% assignment deadline week 4 based upon Lab work from Weeks 1-4; word target 1000, limit 2000
80% assignment deadline week 8 based upon lab work from weeks 5-8; word target 2000, limit 4000

The intent of the first assessment point is to provide rich feedback to students based upon a 20% assignment permitting focussed and improved work to be executedfor the final assignment.

All work in this module is expected to be the effort of the student. Enough equipment is provisioned to allow each student an independent set of apparatus. While classmembers may find sharing operational experience valuable all assessment is based upon a students sole submission based upon. their own experiments and findings.

Reading Material

Core text

Richard L. Sites, Understanding Software Dynamics, Addison-Wesley Professional Computing Series, 2022

Further reading: papers and presentations that you might find interesting. None are required reading – they are well-written and/or informative.

John K. Ousterhout et al., A trace-driven analysis of the UNIX 4.2 BSD file system ACM SIGOPS Operating Systems Review, Vol. 19, No. 5, Dec, 1985 https://dl.acm.org/doi/pdf/10.1145/323627.323631
Luiz Andr´e Barroso, Jimmy Clidaras, Urs H¨olzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition
John L. Hennessy, David A. Patterson, Computer Architecture, A Quantitative Approach 5th Edition
George Varghese, Network Algorithmics. Morgan Kaufmann
Actually keeping datacenter software up and running – how Google runs production systems Site Reliability Engineering Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy free PDF: https://landing.google.com/sre/book.html
How NOT to run datacenters (27 minute video, funny/sad) dotScale 2014 - Robert Kennedy - Life in the Trenches of healthcare.gov https://www.youtube.com/watch?v=GLQyj-kBRdo

An extended (optional) reading list will also be provided.

Understanding Networked-Systems Performance

Aims

Objectives

Syllabus

Assessment

Reading Material

Study at Cambridge

About the University

Research at Cambridge