Course pages 2017–18

Large-scale data processing and optimisation

Principal lecturer: Dr Eiko Yoneki
Taken by: MPhil ACS, Part III
Code: R244
Hours: 16
Class limit: 10 students

Aims

This module provides an introduction to large-scale data processing, optimisation, and the impact on computer system's architecture. Large-scale distributed applications with high volume data processing such as training of machine learning will grow ever more in importance. Supporting the design and implementation of robust, secure, and heterogeneous large-scale distributed systems is essential. To deal with distributed systems with a large and complex parameter space, tuning and optimising computer systems is becoming an important and complex task, which also deals with the characteristics of input data and algorithms used in the applications. Algorithm designers are often unaware of the constraints imposed by systems and the best way to consider these when designing algorithms with massive volume of data. On the other hand, computer systems often miss advances in algorithm design that can be used to cut down processing time and scale up systems in terms of the size of the problem they can address. Integrating machine learning approaches for system optimisation will also be explored in this course.

Syllabus

This course provides perspectives on large-scale data processing, including data-flow programming, stream processing, graph data processing and computer system optimisation, especially using machine learning approaches, thus providing a solid basis to work on the next generation of distributed systems.

The module consists of 8 sessions, with 5 sessions on specific aspects of large-scale data processing research. Each session discusses 3-4 papers, led by the assigned students. One session is a hands-on tutorial on MapReduce using data flow programming and/or Deep Neural Networks using Google TensorFlow with Amazon EC2. The 1st session advises on how to read/review a paper together with a brief introduction on different perspectives in large-scale data processing and optimisation. The last session is dedicated to the student presentation of open-source project studies.

Introduction to large-scale data processing and optimisation
Data flow programming: Map/Reduce to TensorFlow
Large-scale graph data processing: storage, processing model and parallel processing
Map/Reduce and Deep Neural Network using TensorFlow hands-on tutorial with Amazon EC2
Stream data processing and data/query model
Machine Learning for optimisation of computer systems
Task scheduling optimisation and Auto-tuning
Presentation of Open Source Project Study

Objectives

On completion of this module, students should:

Understand key concepts of scalable data processing approaches in future computer systems.
Obtain a clear understanding of building distributed systems using data centric programming and large-scale data processing.
Understand a large and complex parameter space in computer system's optimisation and applicability of Machine Learning approach.

Coursework

Reading Club:

The reading club will involve 1-3 papers every week. At each session, around 2-3 papers are selected under the given topic, and the students present their review work.
Hands-on tutorial session of data flow programming including writing an application of processing streaming in Twitter data and/or Deep Neural Networks using Google TensorFlow using Amazon EC2

Reports

The following three reports are required, which could be extended from the assignment of the reading club, within the scope of data centric systems and networking.

Review report on a full length paper (max 1800 words)
- Describe the contribution of the paper in depth with criticisms
- Crystallise the significant novelty in contrast to other related work
- Suggestions for future work
Survey report on sub-topic in large-scale data processing and optimisation (max 2000 words)
- Pick up to 5 papers as core papers in the survey scope
- Read the above and expand reading through related work
- Comprehend the view and finish an own survey paper
Project study and exploration of a prototype (max 2500 words)
- What is the significance of the project in the research domain?
- Compare with similar and succeeding projects
- Demonstrate the project by exploring its prototype

Reports 1 and 2 should be handed in by the end of 5th week and 7th week of the course (in no particular order). Report 3 should be handed in by the end of the Michaelmas Term.

Assessment

The final grade for the course will be provided as a percentage, and the assessment will consist of two parts:

20%: for reading club (participation, presentation)
80%: for the three reports:
- 20%: Intensive review report
- 25%: Survey report
- 35%: Project study

Recommended reading

[1] Abadi, M. et al. TensorFlow: A System for Large-Scale Machine Learning, OSDI, 2016.
[2] Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N. & G. Czajkowski: Pregel: A System for Large-Scale Graph Processing, SIGMOD, 2010.
[3] Ansel, J. el al. Opentuner: an extensible framework for program autotuning. PACT, 2014.
[4] Hong, S., Chafi, H., Sedlar, E., Olukotun, K.: Green-Marl: A DSL for Easy and Efficient Graph Analysis, ASPLOS, 2012.
[5] E. Zeitler and T.Risch: Massive scale-out of expensive continuous queries, VLDB, 2011.
[6] V. Dalibard, M. Schaarschmidt, E. Yoneki: BOAT: Building Auto-Tuners with Structured Bayesian Optimization, WWW, 2017.
[7] D. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi: Naiad: A Timely Dataflow System, SOSP, 2013.
[8] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin: Powergraph: distributed graph-parallel computation on natural graphs. OSDI, 2012.
[9] B.Gedik, H. Andrade, K. Wu, P. Yu, M. Doo: SPADE: the system S Declarative Stream Processing Engine, SIGMOD. 2008.
[10] M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, L. P. Chew: Scheduling Strategies for Optimistic Parallel Execution of Irregular Programs, SPAA, 2008.
[11] I. Gog, M. Schwarzkopf, A. Gleave, R. Watson, S. Hand: Firmament: fast, centralized cluster scheduling at scale, OSDI, 2016.
[12] Martín Abadi et al. Tensorflow: A system for large-scale machine learning, OSDI, 2016.

A complete list can be found on the course material web page. See also 2015-2016 course material on the previous course Data Centric Systems and Networking.

Department of Computer Science and Technology