Course pages 2015–16
Data Centric Systems and Networking
Principal lecturer: Dr Eiko Yoneki
Taken by: MPhil ACS, Part III
Code: R212
Hours: 16 (Eight 2-hour seminar sessions (combination of lectures and reading club))
Class limit: 17 students
Prerequisites: Undergraduate network architectures and operating systems courses
Aims
This module provides an introduction to data centric systems and networking, where data is a token in programming flow and networking and its impact on the computer system's architecture. Large-scale distributed applications with big data processing will grow ever more in importance and become a pervasive aspect of the lives of millions of users. Supporting the design and implementation of robust, secure, and heterogeneous large-scale distributed systems is essential.
Syllabus
This course provides various perspectives on data centric systems and networking, including content-based routing, data-flow programming, stream processing, and large-scale graph data processing, thus providing a solid basis to work on the next generation of distributed systems and communication paradigms.
The module consists of 8 sessions, with 5 sessions on specific aspects of data-centric systems and networking research. Each session discusses 2-3 papers, led by the assigned students. One session is a hands-on tutorial on MapReduce using data flow programming with Amazon EC2. The 1st session advises on how to read/review a paper together with a brief introduction of different perspectives in data-centric systems. The last session is dedicated to the presentation of the open-source project studies presented by the students. One guest lecture is planned, covering inspiring current research on stream processing systems.
- Introduction to data centric systems and networking
- Programming in data centric environment
- Large-scale graph data processing: Storage, processing model and parallel processing
- MapReduce hands-on tutorial using data-flow programming with Amazon EC2
- Scheduling irregular tasks: Optimisation in parallel computing environments
- Stream data processing and data/query model
- Data centric aspects in networking (Content centric mdoel in Internet/data center)
- Presentation of Open Source Project Study
Objectives
On completion of this module, students should:
- Understand key concepts of data centric approaches in future networking and systems.
- Obtain a clear understanding of building distributed systems using data centric programming and large-scale data processing.
Coursework
Reading Club:
- The reading club will involve 1-3 papers every week. At each session, around 2-3 papers are selected under the given topic, and the students present their review work.
- Hands-on tutorial session of MapReduce parallel computing using data flow programming with Amazon EC2, including writing an application of processing streaming in Twitter data.
Reports:
The following three reports are required, which could be extended from the assignment of the reading club or a different one within the scope of data centric systems and networking.
- Review report on a full length of paper (max 1800 words)
- Describe the contribution of the paper in depth with criticisms
- Crystallise the significant novelty in contrast to other related work
- Suggestions for future work
- Survey report on sub-topic in data centric networking (max 2000 words)
- Pick up to 5 papers as core papers in the survey scope
- Read the above and expand reading through related work
- Comprehend the view and finish an own survey paper
- Project study and exploration of a prototype (max 2500 words)
- What is the significance of the project in the research domain?
- Compare with similar and succeeding projects
- Demonstrate the project by exploring its prototype
The reports 1 and 2 should be handed in by the end of 5th week and 7th week of the course (not in any particular order). The report 3 should be handed in by the end of the Michaelmas Term.
Assessment
The final grade for the course will be provided as a percentage, and the assessment will consist of two parts:
- 20%: for reading club (participation, presentation)
- 80%: for the three reports:
- 20%: Intensive review report
- 25%: Survey report
- 35%: Project study
Recommended reading
[1] Malewicz, G., Austern, M., Bik,
A., Dehnert, J., Horn, I., Leiser, N. & G. Czajkowski: Pregel: A System
for Large-Scale Graph Processing, SIGMOD, 2010.
[2] Jacobson, V., Smetters, D.K., Thornton, J.D., Plass, M.F., Briggs, N.H.,
& R.L. Braynard: Networking named content, CoNEXT, 2009.
[3] Bhatotia, P., Wieder, A., Rodrigues, R., Acar, A., Pasquini A: Incoop:
MapReduce for incremental computation, ACM SOCC, 2011.
[4] Hong, S., Chafi, H., Sedlar, E., Olukotun, K.: Green-Marl: A DSL for
Easy and Efficient Graph Analysis, ASPLOS, 2012.
[5] E. Zeitler and T.Risch: Massive
scale-out of expensive continuous queries, VLDB, 2011.
[6] A. Kyrola and G. Blelloch:
Graphchi: Large-scale graph computation on just a PC, OSDI, 2012.
[7] D. Murray, F. McSherry, R.
Isaacs, M. Isard, P. Barham, M. Abadi: Naiad: A Timely Dataflow System, SOSP,
2013.
[8] J. E. Gonzalez, Y. Low, H. Gu,
D. Bickson, and C. Guestrin: Powergraph: distributed graph-parallel computation
on natural
graphs. OSDI, 2012.
A complete list can be found on the course material web page. See also 2014-2015 course material web page: http://www.cl.cam.ac.uk/~ey204/teaching/ACS/R212_2014_2015.