skip to primary navigationskip to content

Department of Computer Science and Technology

Masters

 

Course pages 2023–24

Understanding Networked-Systems Performance

Principal lecturer: Prof Andrew Moore
Taken by: MPhil ACS, Part III
Code: P56
Term: Lent
Hours: 16 ((8 x 2hrs first half lecture, second half lab))
Format: In-person lectures
Class limit: max. 12 students
Prerequisites: A student must possess knowledge comparable to the undergraduate subjects on Unix Tools, C/C++ programming and Computer Networks. Optionally; a student will be advantaged by doing an introduction to Computer Systems Modelling; such as the Part 2 Computer Systems Modelling subject as well as having a background in measurement methodologies such as that of L50: Introduction to networking and systems measurements provided in the ACS/Part III curriculum.
Moodle, timetable

Aims

This is a practical course, actively building and extending software tools to observe the detailed behaviour of transaction-oriented datacenter-like software. Students will observe and understand sources of user-facing tail latency, including that stemming
from resource contention, cross-program interference, bad software locking, and simple design errors.
A 2-hour weekly hybrid, lecture/lab format permits students continuous monitored progress with complexity of tasks building naturally upon the previous weeks learning.

Objectives

Upon successful completion of the course, students will be able to:

  • Make order-of-magnitude estimates of software, hardware, and I/O speeds
  • Make valid measurements of actual software, hardware, and I/O speeds
  • Create observation facilities, including logs and dashboards, as part of a software system design
  • Create tracing facilities to fully observe the execution of complex software
  • Time-align traces across multiple computers
  • Display dense tracing information in meaningful ways
  • Reason about the sources of real-time and transaction delays, including cross-program resource interference, remote procedure call delays, and software locking surprises
  • Fix programs based on the above reasoning, making their response times faster and more robust


Syllabus

Measurement
   Week 1 Intro, Measuring CPU time, rdtsc, Measuring memory access times, Measuring disks,
   Week 2 gettimeofday, logs Measuring networks, Remote Procedure Calls, Multi-threads, locks
Observation
   Week 3 RPC, logs, displaying traces, interference
   Week 4 Antagonist programs, Logging, dashboards, profiling.
KUtrace
   Week 5 Kernel patches, hello world, post-processing
   Week 6 KUtrace multi-CPU time display
   Week 7 Client-server KUtrace, with antagonists and interference
   Week 8 Other trace mysteries


Assessment

This is a Lab-based module; assessment is based upon reports covering guided laboratory work performed each week. Assessment will be via two submissions:

  • 20% assignment deadline week 4 based upon Lab work from Weeks 1-4; word target 1000, limit 2000
  • 80% assignment deadline week 8 based upon lab work from weeks 5-8; word target 2000, limit 4000

The intent of the first assessment point is to provide rich feedback to students based upon a 20% assignment permitting focussed and improved work to be executedfor the final assignment.

All work in this module is expected to be the effort of the student. Enough equipment is provisioned to allow each student an independent set of apparatus. While classmembers may find sharing operational experience valuable all assessment is based upon a students sole submission based upon. their own experiments and findings.

Reading Material

Core text

  • Richard L. Sites, Understanding Software Dynamics, Addison-Wesley Professional Computing Series, 2022

Further reading: papers and presentations that you might find interesting. None are required reading – they are well-written and/or informative.

  • John K. Ousterhout et al., A trace-driven analysis of the UNIX 4.2 BSD file system ACM SIGOPS Operating Systems Review, Vol. 19, No. 5, Dec, 1985 https://dl.acm.org/doi/pdf/10.1145/323627.323631
  • Luiz Andr´e Barroso, Jimmy Clidaras, Urs H¨olzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition
  • John L. Hennessy, David A. Patterson, Computer Architecture, A Quantitative Approach 5th Edition
  • George Varghese, Network Algorithmics. Morgan Kaufmann
  • Actually keeping datacenter software up and running – how Google runs production systems Site Reliability Engineering Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy free PDF: https://landing.google.com/sre/book.html
  • How NOT to run datacenters (27 minute video, funny/sad) dotScale 2014 - Robert Kennedy - Life in the Trenches of healthcare.gov https://www.youtube.com/watch?v=GLQyj-kBRdo

An extended (optional) reading list will also be provided.