Understanding Networked-Systems Performance
Principal lecturer: Prof Andrew Moore
Taken by: MPhil ACS, Part III
Code: P56
Term: Lent
Hours: 16 ((8 x 2hrs first half lecture, second half lab))
Format: In-person lectures
Class limit: max. 12 students
Prerequisites: A student must possess knowledge comparable to the undergraduate subjects on Unix Tools, C/C++ programming and Computer Networks. Optionally; a student will be advantaged by doing an introduction to Computer Systems Modelling; such as the Part 2 Computer Systems Modelling subject as well as having a background in measurement methodologies such as that of L50: Introduction to networking and systems measurements provided in the ACS/Part III curriculum.
Moodle, timetable
Aims
This is a practical course, actively building and extending
software tools to observe the detailed behaviour of
transaction-oriented datacenter-like software. Students
will observe and understand sources of user-facing tail
latency, including that stemming
from resource contention, cross-program interference, bad
software locking, and simple design errors.
A 2-hour weekly hybrid, lecture/lab format permits students
continuous monitored progress with complexity of tasks
building naturally upon the previous weeks learning.
Objectives
Upon successful completion of the course, students will be able to:
- Make order-of-magnitude estimates of software, hardware, and I/O speeds
- Make valid measurements of actual software, hardware, and I/O speeds
- Create observation facilities, including logs and dashboards, as part of a software system design
- Create tracing facilities to fully observe the execution of complex software
- Time-align traces across multiple computers
- Display dense tracing information in meaningful ways
- Reason about the sources of real-time and transaction delays, including cross-program resource interference, remote procedure call delays, and software locking surprises
- Fix programs based on the above reasoning, making their response times faster and more robust
Syllabus
Measurement
Week 1 Intro, Measuring CPU time, rdtsc, Measuring
memory access times, Measuring disks,
Week 2 gettimeofday, logs Measuring networks, Remote
Procedure Calls, Multi-threads, locks
Observation
Week 3 RPC, logs, displaying traces,
interference
Week 4 Antagonist programs, Logging, dashboards,
profiling.
KUtrace
Week 5 Kernel patches, hello world,
post-processing
Week 6 KUtrace multi-CPU time display
Week 7 Client-server KUtrace, with antagonists and
interference
Week 8 Other trace mysteries
Assessment
This is a Lab-based module; assessment is based upon reports covering guided laboratory work performed each week. Assessment will be via two submissions:
- 20% assignment deadline week 4 based upon Lab work from Weeks 1-4; word target 1000, limit 2000
- 80% assignment deadline week 8 based upon lab work from weeks 5-8; word target 2000, limit 4000
The intent of the first assessment point is to provide rich feedback to students based upon a 20% assignment permitting focussed and improved work to be executedfor the final assignment.
All work in this module is expected to be the effort of the student. Enough equipment is provisioned to allow each student an independent set of apparatus. While classmembers may find sharing operational experience valuable all assessment is based upon a students sole submission based upon. their own experiments and findings.
Reading Material
Core text
- Richard L. Sites, Understanding Software Dynamics, Addison-Wesley Professional Computing Series, 2022
Further reading: papers and presentations that you might find interesting. None are required reading – they are well-written and/or informative.
- John K. Ousterhout et al., A trace-driven analysis of the UNIX 4.2 BSD file system ACM SIGOPS Operating Systems Review, Vol. 19, No. 5, Dec, 1985 https://dl.acm.org/doi/pdf/10.1145/323627.323631
- Luiz Andr´e Barroso, Jimmy Clidaras, Urs H¨olzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition
- John L. Hennessy, David A. Patterson, Computer Architecture, A Quantitative Approach 5th Edition
- George Varghese, Network Algorithmics. Morgan Kaufmann
- Actually keeping datacenter software up and running – how Google runs production systems Site Reliability Engineering Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy free PDF: https://landing.google.com/sre/book.html
- How NOT to run datacenters (27 minute video, funny/sad) dotScale 2014 - Robert Kennedy - Life in the Trenches of healthcare.gov https://www.youtube.com/watch?v=GLQyj-kBRdo
An extended (optional) reading list will also be provided.