Course pages 2016–17
Data Centric Systems and Networking
Principal lecturer: Dr Eiko Yoneki
Taken by: MPhil ACS, Part III
Hours: 16 (Eight 2-hour seminar sessions (combination of lectures and reading club))
Class limit: 17 students
Prerequisites: Undergraduate network architectures and operating systems courses
Data Centric Systems and Networking
Leader: Dr Eiko Yoneki
Prerequisites: Undergraduate operating systems courses (suggested)
Structure: Eight 2-hour seminar sessions (combination of lectures and reading club)
Class limit: 10 students
This module provides an introduction to data centric systems, where data is a token in programming flow, and its impact on the computer system's architecture. Large-scale distributed applications with big data processing will grow ever more in importance. Supporting the design and implementation of robust, secure, and heterogeneous large-scale distributed systems is essential. To deal with distributed systems with a large parameter space, tuning and optimising computer systems is becoming an important and complex task. Integrating machine learning approaches for system optimisation will also be explored in this course.
This course provides various perspectives on data centric systems, including data-flow programming, stream processing, large-scale graph data processing and computer system's optimisation especially use of machine learning approaches, thus providing a solid basis to work on the next generation of distributed systems.
The module consists of 8 sessions, with 5 sessions on specific aspects of data-centric systems and networking research. Each session discusses 3-4 papers, led by the assigned students. One session is a hands-on tutorial on MapReduce using data flow programming and Deep Neural Networks using Google TensorFlow with Amazon EC2. The 1st session advises on how to read/review a paper together with a brief introduction of different perspectives in data-centric systems. The last session is dedicated to the presentation of the open-source project studies presented by the students. One guest lecture is planned, covering inspiring current research on stream processing systems.
- Introduction to data centric systems and networking
- Programming in data centric environment
- Large-scale graph data processing: storage, processing model and parallel processing
- Data Flow Programming and Deep Neural Network suing TensorFlow Handson Tutorial with Amazon EC2
- Stream Data Processing and Data/Query Model
- Optimisation in data processing and Auto-tuning
- Machine Learning for Computer System's Optimisation
- Presentation of Open Source Project study
On completion of this module, students should:
- Understand key concepts of data centric approaches in future systems.
- Obtain a clear understanding of building distributed systems using data centric programming and large-scale data processing.
- Use of Machine Learning in computer system's optimisation
- The reading club will involve 2-4 papers every week. At each session, around 3 papers are selected under the given topic, and the students present their review work.
- Hands-on tutorial session of MapReduce parallel computing using data flow programming including writing an application of processing streaming in Twitter data.and Deep Neural Networks using Google TensorFlow with Amazon EC2.
The following three reports are required, which could be extended from the assignment of the reading club or a different one within the scope of data centric systems and networking.
- Review report on a full length of paper (max 1800 words)
- Describe the contribution of the paper in depth with criticisms
- Crystallise the significant novelty in contrast to other related work
- Suggestions for future work
- Pick up to 5 papers as core papers in the survey scope
- Read the above and expand reading through related work
- Comprehend the view and finish an own survey paper
- What is the significance of the project in the research domain?
- Compare with similar and succeeding projects
- Demonstrate the project by exploring its prototype
The reports 1 and 2 should be handed in by the end of 5th week and 7th week of the course (not in any particular order). The report 3 should be handed in by the end of the Lent Term.
The final grade for the course will be provided as a percentage, and the assessment will consist of two parts:
- 20%: for reading club (participation, presentation)
- 80%: for the three reports:
- 20%: Intensive review report
- 25%: Survey report
- 35%: Project study
 Abadi, M. et al. TensorFlow: A System for Large-Scale Machine Learning, OSDI, 2016.
G., Austern, M., Bik, A., Dehnert,
J., Horn, I., Leiser, N. & G. Czajkowski:
Pregel: A System for Large-Scale Graph Processing, SIGMOD, 2010.
 Ansel, J. el al. Opentuner: an extensible framework for program autotuning. PACT, 2014.
 Hong, S., Chafi, H., Sedlar, E., Olukotun, K.: Green-Marl: A DSL for Easy and Efficient Graph Analysis, ASPLOS, 2012.
 E. Zeitler and T.Risch: Massive scale-out of expensive continuous queries, VLDB, 2011.
 A. Kyrola and G. Blelloch: Graphchi: Large-scale graph computation on just a PC, OSDI, 2012.
 D. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi: Naiad: A Timely Dataflow System, SOSP, 2013.
 J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin: Powergraph: distributed graph-parallel computation on natural graphs. OSDI, 2012.
 B.Gedik, H. Andrade, K. Wu, P. Yu, and M. Doo: SPADE: the system S Declarative Stream Processing Engine, SIGMOD. 2008.
 M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, L. P. Chew: Scheduling Strategies for Optimistic Parallel Execution of Irregular Programs, SPAA, 2008.
A complete list can be found on the course material web page. See also 2016-2017 course material web page: http://www.cl.cam.ac.uk/~ey204/teaching/ACS/R212_2016_2017.