Data Centric Systems and Networking (2014-2015 Michaelmas Term)

Overview

This module provides an introduction to data centric systems, where data is a token in the programming flow and networking, and its impact on the computer system's architecture. Large-scale distributed applications with big data processing will grow ever more in importance and become a pervasive aspect of the lives of millions of users. Supporting the design and implementation of robust, secure, and heterogeneous large-scale distributed systems is essential. This course provides various perspectives on data centric systems and networking, including data-flow programming, stream processing, content-based routing and large-scale graph data processing, thus providing a solid basis to work on the next generation of distributed systems and communication paradigms. This course provides various perspectives on data centric systems and networking, including data-flow programming, stream processing, content-based routing and large-scale graph data processing, thus providing a solid basis to work on the next generation of distributed systems and communication paradigms. On completion of this module, the students should:

Understand key concepts of data centric approaches in future computer systems.
Obtain a clear understanding of building distributed systems using data-centric programming and large-scale data processing.

Module Structure

The module consists of 8 sessions, with 5 sessions on specific aspects of data-centric systems and networking research. Each session discusses 2-3 papers, led by the assigned students. One session is a hands-on tutorial on MapReduce using data flow programming with Amazon EC2. The 1st session advises on how to read/review a paper together with a brief introduction of different perspectives in data-centric systems. The last session is dedicated to the presentation of the open-source project studies presented by the students. One guest lecture is planned, covering inspiring current research on stream processing systems.

Schedule and Reading List

We’ll meet in GS15 (FS07) every Monday (from October 13 to December 1) in 2014. The time slot is 15:00-17:00.

2014/10/13 Session 1: Introduction to Data Centric Systems and Networking

Introduction to Data Centric Systems and Networking (slides)
- Assignment details
- Guidance of how to read/review/present a paper
- Guidance to Open Source Project
Technologies for Big Data Processing (slides)

2014/10/20 Session 2: Programming in Data Centric Environment

Data flow programming, Cluster Computing

1. Yuan Yu, Michael Isard, D. Fetterly, M. Budiu, U. Erlingsson, P.K. Gunda, J. Currey: DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, OSDI, 2008.

Varun Gandhi (slides)
2. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, I. Stoica: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI, 2013.

3. Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, Russell Sears: Boom analytics: exploring data-centric, declarative programming for the cloud, Eurosys 2010.

Mariana-Cristina Marasoiu (slides)
4. J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004.

5. Derek Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy and Steven Hand: Ciel: a universal execution engine for distributed data-flow computing, NSDI 2011.

Frank McSherry's Talk on Differential Dataflow is here.

6.1. Frank McSherry, Rebecca Isaacs, Michael Isard, and Derek G. Murray, Composable Incremental and Iterative Data-Parallel Computation with Naiad, no. MSR-TR-2012-105, 2012.

Karthik Nilakant (slides)
6.2. D. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi: Naiad: A Timely Dataflow System, SOSP, 2013.

Neil Satra (slides)
7. P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini: Incoop: MapReduce for incremental computation, ACM SOCC, 2011.

8. Dionysios Logothetis, Christopher Olston, Benjamin Reed, Kevin Webb and Kenneth Yocum: Stateful Bulk Processing for Incremental Analytics, SOCC, 2010.

2014/10/27 Session 3: Processing Models of Large-Scale Graph Data

Scalable distributed processing of graph structured data, processing model, and programming model
Apatch Giraph demo

William Jones (slides)
1. G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski: Pregel: A System for Large-Scale Graph Processing, SIGMOD, 2010.

2. U. Kang, C. E. Tsourakakis, C. Faloutsos: PEGASUS: A peta-scale graph mining system - Implementation and observations, ICDM , 2009.

Philip Leonard (slides)
3. Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, Z.Zhang: MadLINQ: large-scale distributed matrix computation for the cloud, EuroSys, 2012.

Maciej Biskupiak (slides)
4. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein: Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB, 2012.

David Reti (slides)
5. .J. Shun and G. Blelloch: Ligra: A Lightweight Graph Processing Framework for Shared Memory, PPoPP, 2013.

6. J. Kim, W. Han, S. Lee, K. Park, H. Yu: OPT: A New Framework for Overlapped and Parallel Triangulation in Large-scale Graphs, SIGMOD, 2014.

7. J. Gonzalez, R. Xin, A. Dave, D. Crankshaw, M. Franklin, I. Stoica: GraphX: Graph Processing in a Distributed Dataflow Framework, OSDI, 2014.

8. M. Han, K. Daudjee, K. Ammar, M. Ozsu, X. Wang, T. Jin: An Experimental Comparison of Pregel like Graph Processing Systems, SIGMOD, 2014.

Micahel Schaarschmidt (slides)
9. B. Shao, H. Wang, Y. Li: Trinity: A Distributed Graph Engine on a Memory Cloud, SIGMOD, 2013.

2014/11/03 Session 4: Data Flow Programming Handson Tutorial with Amazon EC2

This tutorial session will be in SW02.
Data Flow Programming Tutorial: Hands-on tutorial session of Naiad distributed computation framework on Amazon’s cloud computing platform including implementation of MapReduce parallel computing. (Tutor: Karthik Nilakant).
https://www.cambridgeplus.net/tutorials/2014/Naiad/NaiadTutorial_2014.html
If you want to work using your laptop (Linux), bring it with you!

2014/11/13 Session 5: Optimised Approaches in Graph Data Processing

This session will be on Thursday November 13 at 10:00-12:00 in GS15.
Guest lecture: Amitabha Roy (EPFL): "Tackling Large Graphs with Secondary Storage" (slides).
Optimised approaches in large scale graph processing (single computer, scalable algoritisms, streaming, algorithmic, GraphDB)

1. W. Han, S. Lee, K. Park, J. Lee, M. iKim, J. Kim, H. Yu: TurboGraph: A Fast Parallel Graph Engine Handling
Billion-scale Graphs in a Single PC, KDD, 2013.

Patrick Short (slides)
2. A. Kyrola and G. Blelloch: Graphchi: Large-scale graph computation on just a PC, OSDI, 2012.

3. A. Roy, I. Mihailovic, W. Zwaenepoel: X-Stream: Edge-Centric Graph Processing using Streaming Partitions, SOSP, 2013.

4. X. Hu, Y. Tao, C. Chung: Massive Graph Triangulation, SIGMOD, 2013.

Matt Huxtable (slides)
5. W. Xie, G. Wang, D.Bindel, A. Demers, J. Gehrke: Fast Iterative Graph Computation with Block Updates, VLDB, 2014.

6. S. Arifuzzaman, M. Khan, M. Marathe: PATRIC: A Parallel Algorithm for Counting Triangles in Massive Networks, CIKM, 2014.

Ana Trisovic (slides)
7. J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin: Powergraph: distributed graph-parallel computation on natural
graphs. OSDI, 2012.

8. A. Buluc, K. Madduri: Parallel Breadth-First Search on Distributed Memory Systems, SC, 2005.

9. C. Luk, S. Hong, H. Kim: Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping , PLDI, 2009.

2014/11/17 Session 6: Stream Data Processing and Data/Query Model

Guest lecture: Peter Pietzuch (Imperial College London) "Drinking from the Fire Hose: The Rise of Scalable Stream Processing Systems" (slides)
Data and continuous query in steam data processing

1. V. Gulisano, R. Jimenez-Peris, M. Patiño-Martinez, P. Valduriez: StreamCloud: A Large Scale Data Streaming System, ICDCS, 2010.

2. V. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, I. Stoica: Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP, 2013.

3. R. Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch: Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management, SIGMOD, 2013.

4. D. Abadi, Y. Ahmad, M. Balazinska et al. : The Design of the Borealis Stream Processing Engine, CIDR, 2005.

5. S. Babu, J. Widom: Continuous Queries over Data Streams, SIGMOD Record 30(3), 2001.

William Jones (slides)
6. B.Gedik, H. Andrade, K. Wu, P. Yu, and M. Doo: SPADE: the system S Declarative Stream Processing Engine , SIGMOD. 2008.

Mariana-Cristina Marasoiu (slides)
7. E. Zeitler and T.Risch: Massive scale-out of expensive continuous queries, VLDB, 2011.

Maciej Biskupiak (slides)
8. R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, E. Chen: Kineograph: Taking the Pulse of a Fast-Changing and Connected World, EuroSys, 2012.

2014/11/24 Session 7: Scheduling Irregular Tasks in Parallel Computing Environments

Scheduling irregular tasks and programming in heterogeneous parallel processing environmenets

Neil Satra (slides)
1. S. Hong, H. Chafi, E. Sedlar, K.Olukotun: Green-Marl: A DSL for Easy and Efficient Graph Analysis, ASPLOS, 2012.

Micahel Schaarschmidt (slides)
2. D. Prountzos, R. Manevich, K. Pingali: Elixir: A System for Synthesizing Concurrent Graph Programs, OOPSLA, 2012.

3. D. Nguyen, A. Lenharth, K. Pingali: A Lightweight Infrastructure for Graph Analytics, SOSP 2013.

4. J. Zhong, B. He: Medusa: Simplified Graph Processing on GPUs, IEEE TPDS, 2013.

Matt Huxtable (slides)
5. C. Rossbach, Y. Yu, J. Currey, J-P. Martin, D. D. Fetterly: Dandelion: a Compiler and Runtime for Heterogeneous Systems, SOSP 2013.

Varun Gandhi (slides)
6. A. Gharaibeh, E. Santos-Neto, L. Costa, M. Ripeanu: Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems, IEEE TPC, 2014.

7. M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, L. P. Chew: Scheduling Strategies for Optimistic Parallel Execution of Irregular Programs, SPAA, 2008.

Philip Leonard (slides)
8. S. Salihoglu, J. Widom: Optimizing Graph Algorithms on Pregellike Systems, VLDB, 2014.

David Reti (slides)
9. D. Merrill, M. Garland, A. Grimshaw: Scalable GPU Graph Traversal, PPoPP, 2012.

2014/12/01 Session 8: Presentation of Open Source Project Study

Start @15:00 in FW11.
Presentation of Open Source Project Study by all (10 minutes of presentation including Q&A for each presentation)

15:00 David Reti (GraphChi) GraphChi performance op-down and bottom-up BFS (slides)
15:10 Maciej Biskupiak (GaphLab) Distributed Graph Colouring on the GraphLab abstraction (slides)
15:20 Mariana-Cristina Marasoiu (GraphLab) Multilayer networks in GraphLab (slides)
15:30 Matt Huxtable (Naiad) A timely dataflow-based k-means clustering algorithm (slides)
15:40 Micahel Schaarschmidt (Storm) Using Apache Storm to track location-based sentiments (slides)
15:50 Neil Satra (Spark) Distributed Regression using Spark (slides)
16:00 Philip Leonard (GraphLab) Exploring Graph Colouring Heuristics in GraphLab (slides)
16:10 Varun Gandhi (Spark) Implement a distributed Alternating Least Squares Algorithm for matrix completion (slides)
16:20 William Jones (Giraph) Extending Musketeer to Apache Giraph (slides)

16:45-17:00 Wrap-up Discussion (slides)

Coursework 1 (Reading Club)

The reading club will require you to read between 1 and 3 papers every week. You need to fill out simple review_log (MS word format, text format) prior to each session and email me by the end of Sunday. The minimum requirement of review_log is one per session, but you can read as many as you want and fill the review_log for each paper you read. review_log is not marked but 'tick'.

At each session, around 3 papers are selected under the session topic, and if you are assigned to present your review work, please prepare 15-20 minutes slides for presenting your review work. Your presented material should also be emailed by the following day Wednesday. You would present your review work approximately twice during the course. The paper includes following two types and you can focus on the specified aspects upon reviewing the paper.

Full length papers
- What is the significant contribution?
- What is the difference from the existing works?
Short length papers
- What is the novel idea?
- What is required to complete the work?

Coursework 2 (Reports)

The following three reports are required, which could be extended from the reading assignment of the reading club or a different one within the scope of data centric networking.

Review report on a full length of paper (~1800 words)
- Describe the contribution of paper in depth with criticism
- Crystallise the significant novelty in contrast to the other related work
- Suggestion for future work
Survey report on sub-topic in data centric networking (~1800 - max 2000 words)
- Pick up to 5 papers as core papers in your survey scope
- Read the above and expand your reading through related work
- Comprehend your view and finish as your survey paper
- See how to write a survey paper
Project study and exploration of a prototype (~2500 words)
- See Open Source Projects
- What is the significance of the project in the research domain?
- Compare with the similar and succeeding projects
- Demonstrate the project by exploring its prototype
- Please email your project selection (MS word format or text format <150 words) by November 7 (extended), 2014
- Project presentation on December 1, 2014.

The reports 1 and 2 should be handed in by the end of 5^th week (November 14, 2014 - 12:00 noon ) and 7^th week (November 28, 2014 - 12:00 noon) of the course (not in any particular order). The report 3 should be by the end of the Michaelmas term (January 13, 2015 - 16:00 - but if you could finish by 19th of December, that will be good!).

Assessment

The final grade for the course will be provided as a letter grade or percentage and the assessment will consist of two parts:

20%: for a reading club (Presentation and participation + tick of review_log)
80%: for the three reports
- 20%: Intensive review report
- 25%: Survey report
- 35%: Project study

Open Source Projects

See the candidates of Open Source Projects in data centric networking. The list is not exhausted. If you take anything other than the one in the list, please discuss with me. The purpose of this assignment is to understand the prototype of the proposed architecture, algorithms, and systems through running an actual prototype and present/explain to the other people how the prototype runs, any additional work you have done including your own applications and setup process of the prototype. This experience will give you better understanding of the project. These Open Source Projects come with a set of published papers and you should be able to examine your interests in the paper through running the prototype. Some projects are rather large and may require extensive environment and time; make sure you are able to complete this assignment.

How to Read/Review a Paper

The following papers aid how to read/review a paper.

S. Keshav: How to Read a Paper, ACM SIGCOMM Computer Communication Review 83 Volume 37, Number 3, July 2007.
T. Roscoe: Writing Reviews for Systems Conferences, 2007.
Simon Peyton-Jones: How to write a great paper and give a great talk about it, Microsoft Research Cambridge.
David A. Patterson: How to Have a Bad Career in Research/Academia, 2001.

Further supplement: see ‘how to read/review a paper’ section in Advanced Topics in Computer Systems by Steven Hand.

Presentations

Presentations should be about 20-25 minutes long, where you need to cover the following aspects.

What are the background and the problem domain of the paper? What is the motivation of the presented work? What is the difference from the existing works? What is the novel idea? How did the paper change/unchange the research in the research community?
What is the significant contribution? How did the authors tackle the problem? Did the authors obtain expected result from their trial?
How do you like the paper and why? What is the takeaway message to you (and to research community)? What is required to complete the work?

The following document aids in presenting a review.

Perdita Stevens in University of Edinburgh How to [read, present, review] a research paper.

How to write a survey paper

A survey paper provides the readers with an exposition of existing work that is comprehensive and organized. It must expose relevant details associated in the surveying area, but it is important to keep a consistent level of details and to avoid simply listing the different works. Thus a good survey paper should demonstrate a summary of recent research results in a novel way that integrates and adds understanding to work in the field. For example, you can take an approach by classifying the existing literature in your own way; develop a perspective on the area, and evaluate trends. Thus, after defining the scope of your survey, 1) classify and organize the trend, 2) critical evaluation of approaches (pros/cons), and 3) add your analysis or explanation (e.g. table, figure). Also adding reference and pointer to further in-depth information is important (summary from Rich Wolski’s note).

Papers for OS Principles (Distributed Storage and Deterministic Parallelism)

Following papers will help you to understand distributed storage and parallelism.

Systems Research and System Design

1. B. Lampson: Hints for Computer Systems Design (Revised), ACM OSR 1983.

Distributed Storage

2. S. Ghemawat, H. Gobioff, and S. Leung: The Google File System, ACM SOSP 2003.
3. F. Chang et al: BigTable: A Distributed Storage System for Structured Data, USENIX OSDI 2006.
4. G. DeCandia et al: Dynamo: Amazon's Highly Available Key-value Store, ACM SOSP 2007.

Deterministic Parallelism

5. J. Devietti et al: DMP: Deterministic Shared Memory Multiprocessing, ACM ASPLOS 2009.
6. A. Aviram, et al: Efficient System-Enforced Determistic Parallelism, USENIX OSDI 2010.
7. T. Liu et al: Dthreads: Efficient and Determistic Multithreading, ACM SOSP 2011.

Contact Email

Please email to eiko.yoneki@cl.cam.ac.uk for your submission of course work or any question.

Computer Laboratory