LargeScale Data Processing and Optimisation (20202021 Michaelmas Term)

OverviewThis module provides an introduction to largescale data processing, optimisation, and the impact on computer system's architecture. Largescale distributed applications with high volume data processing such as training of machine learning will grow ever more in importance. Supporting the design and implementation of robust, secure, and heterogeneous largescale distributed systems is essential. To deal with distributed systems with a large and complex parameter space, tuning and optimising computer systems is becoming an important and complex task, which also deals with the characteristics of input data and algorithms used in the applications. Algorithm designers are often unaware of the constraints imposed by systems and the best way to consider these when designing algorithms with massive volume of data. On the other hand, computer systems often miss advances in algorithm design that can be used to cut down processing time and scale up systems in terms of the size of the problem they can address. Integrating machine learning approaches (e.g. Bayesian Optimisation, Reinforcement Learning) for system optimisation will also be explored in this course. On completion of this module, the students should:
Module Structure
Schedule and Reading ListAll the sessions will be online because of COVID19 and we'll meet every Thursday (from October 8 to November 26) in 2020. The time slot is 10:0012:00.1 2020/10/08 Session 1: Introduction to LargeScale Data Processing and Optimisation
2020/10/15 Session 2: Data flow programming
1. Yuan Yu, Michael Isard, D. Fetterly, M. Budiu, U.
Erlingsson, P.K. Gunda, J. Currey:
2. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
M. McCauley, M. Franklin, S. Shenker, I. Stoica:
Franciszek Budrowski
(slides)
*4. J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004.
*5. Derek Murray, Malte Schwarzkopf, Christopher Smowton,
Steven Smith, Anil Madhavapeddy and Steven Hand:
*6. Naiad
Frank McSherry's Talk on Differential Dataflow is
here.
Alexander Frost
(slides)
*6.3. F. McSherry, A. Lattuada, M. Schwarzkopf, T. Roscoe: Shared Arrangements: practical interquery sharingfor streaming dataflows, VLDB, 2020. 7. P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini: Incoop: MapReduce for incremental computation, ACM SOCC, 2011.
Luou Wen
(slides)
*8.1 M. Abadi et al.: TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems, Preliminary White Paper, 2015. 9. M. Looks et al.: Deep Learning with Dynamic Computation Graphs, ICLR, 2017. 10. M. Abadi, M. Isard and D.
Murray: A Computational
Model for TensorFlow  An Introduction, MAPL, 2017.
Ross Tooley
(slides)
14. T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. FernandezMoctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, S. Whittle: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in MassiveScale, Unbounded, OutofOrder Data Processing, VLDB, 2015. 15. S. Li, Y. Zhao, R. Varma, et. al: PyTorch Distributed: Experiences on Accelerating Data Parallel Training, VLDB, 2020. 2020/10/22 Session 3: Largescale graph data processing
*1. G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski: Pregel: A System for LargeScale Graph Processing, SIGMOD, 2010. 2. Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, Z.Zhang: MadLINQ: largescale distributed matrix computation for the cloud, EuroSys, 2012. 3. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein: Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB, 2012.
Sean Parker
(slides)
Armins Stepanjans
(slides)(Armins' blog on Ligra)
6. J. Gonzalez, R. Xin, A. Dave, D. Crankshaw, M. Franklin, I. Stoica: GraphX: Graph Processing in a Distributed Dataflow Framework, OSDI, 2014. 7. B. Shao, H. Wang, Y. Li: Trinity: A Distributed Graph Engine on a Memory Cloud, SIGMOD, 2013. 8. A. Kyrola and G. Blelloch: Graphchi: Largescale graph computation on just a PC, OSDI, 2012.
Samuil Stoychev
(slides)
10. A. Roy, L. Bindschaedler, J. Malicevic and W. Zwaenepoel: Chaos: Scaleout Graph Processing from Secondary Storage , SOSP, 2015. 11. F. McSherry, M. Isard and D. Murray: Scalability! But at what COST? , HOTOS, 2015. 12. X. Hu, Y. Tao, C. Chung: Massive Graph Triangulation, SIGMOD, 2013. 13. W. Xie, G. Wang, D.Bindel, A. Demers, J. Gehrke: Fast Iterative Graph Computation with Block Updates, VLDB, 2014. 14. S. Hong, H. Chafi, E. Sedlar, K.Olukotun: GreenMarl: A DSL for Easy and Efficient Graph Analysis, ASPLOS, 2012. 15. D. Prountzos, R. Manevich, K. Pingali: Elixir: A System for Synthesizing Concurrent Graph Programs, OOPSLA, 2012. 16. D. Nguyen, A. Lenharth, K. Pingali: A Lightweight Infrastructure for Graph Analytics, SOSP 2013. 17. D. Merrill, M. Garland, A. Grimshaw: Scalable GPU Graph Traversal, PPoPP, 2012. 18. A. Gharaibeh, E. SantosNeto, L. Costa, M. Ripeanu: Efficient LargeScale Graph Processing on Hybrid CPU and GPU Systems, IEEE TPC, 2014.
Zhuang Zhang
(slides)
20. H. Dai, Z. Kozareva, B. Dai,
A. Smola and L. Song:
Learning SteadyStates of Iterative Algorithms over Graphs, ICML, 2018.
2020/10/29 Session 4: Map/Reduce and Deep Neural Network using TensorFlow Handson Tutorial
2020/11/05 Session 5: Many Aspects of Optimisation in Computer Systems
1. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, A. Ng.: Large scale distributed deep networks. NIPS, 2012. 2. G. Venkates et al.: Accelerating Deep Convolutional Networks using lowprecision and sparsity, ICASSP, 2017. 3. V. Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning, ICML, 2016. 4. J. Ansel et al. Opentuner: an extensible framework for program autotuning. PACT, 2014. 5. B. Bodin, L. Nardi, MZ Zia et al.: Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding, PACT, 2016. *6. J. Ansel et al.: Petabricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2009. 7. V. Mnih et al.: Playing Atari with Deep Reinforcement Learning, NIPS, 2013. 8. J. Snoek, H. Larochelle, and R. Adams: Practical Bayesian Optimization of Machine Learning Algorithms, NIPS, 2012. 9. B. Teabe et al.: Applicationspecific quantum for multicore platform scheduler, EuroSys, 2016. 10. G. Tesauro et al.: A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation, ICAC, 2006.
Luou Wen
(slides)
14. F. Hutter et al.: Algorithm runtime prediction: Methods&evaluation, Elsevier J. AI, 2014. 15. Z. Jia, M. Zaharia, and A. Aiken: Beyond Data and Model Parallelism for Deep Neural Networks, SYSML, 2019. 16. Ł. Kaiser et al.: Model Based Reinforcement Learning for Atari, arXiv, 2019. 17. H. Liu, K. Simonyan, and Y. Yang: DARTS: Differentiable Architecture Search, arXiv, 2018.
Alexander Frost
(slides)
19. S. Palkar, J. Thomas, A.
Shanbhagy, D. Narayanan, H. Pirky, M. Schwarzkopfy, S. Amarasinghey, and M.
Zaharia:
20. S. Palkar, J. Thomas, D.
Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H.
Pirk, S. Amarasinghe, S. Madden, M. Zaharia:
Evaluating EndtoEnd
Optimization for Data Analytics Applications in Weld, VLDB, 2018.
23. R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. Gonzalez, I. Stoica: Tune: A Research Platform for Distributed Model Selection and Training, ICML, 2018.
*27. T. Chen, T. Moreau, Z. Jiang,
L. Zheng, S. Jiao, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C.
Guestrin, and A. Krishnamurthy:
TVM: An Automated EndtoEnd
Optimizing Compiler for Deep Learning, OSDI, 2018. *28. T. Chen, T. Moreau, Z. Jiang,
L. Zheng, S. Jiao, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C.
Guestrin, and A. Krishnamurthy:
TVM: EndtoEnd Compilation
Stack for Deep Learning, SysML, 2017.
29. H. Zhang et al.: Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters, ATC, 2017.
Armins Stepanjans
(slides)
Ross Tooley
(slides)
32. N. K. Ahmed, et al.: On Sampling from Massive Graph Streams, VLDB, 2017. *33. Kraska, T., Alizadeh, M., Beutel, A., Chi, E.H., Ding, J., Kristo, A., Leclerc, G., Madden, S., Mao, H. and Nathan, V.: SageDB: A learned database system, CIDR, 2019. 34. Ma, L., Ding, B., Das, S. and Swaminathan, A.: Active Learning for ML Enhanced Database Systems, SIGMOD, 2020. 35. A. Kipf, R. Marcus, A. van Renen, M. Stoian, A. Kemper, T. Kraska, and T. Neumann: SOSD: A Benchmark for Learned Indexes, NeurIPS Workshop on ML for Systems, 2019. 36. D. Ha and J. Schmidhuber: World Models, arXiv, 2018 (https://worldmodels.github.io). *37. L. Li te al.: A System for Massively Parallel Hyperparameter Tuning, MLSys, 2020. 2020/11/12 Session 6: Probabilistic Programming
Guest lecture: Brooks Paige (UCL, ATI) Title: Programs as probabilistic models Abstract: Probabilistic models used in quantitative sciences have historically coevolved with methods for performing inference: specific modeling assumptions are made not because they are appropriate to the application domain, but because they are required to leverage existing software packages or inference methods. The emerging field of probabilistic programming aims to reduce the technical and cognitive overhead for writing and designing novel probabilistic models, by introducing a specialized programming language as an abstraction barrier between modeling and inference. While we would ideally be able to provide “automatic” inference for any probabilistic model, this proves severely challenging for models written in sufficiently expressive languages. In this talk I will discuss some of these difficulties, and provide an introduction and overview of different approaches to probabilistic programming. Bio: Brooks Paige is an associate professor in machine learning at the University College London AI Centre. He is also a Turing fellow at the Alan Turing Institute, and a statistical ambassador for the Royal Statistical Society. He holds a D.Phil in Engineering Science from the University of Oxford, where he was supervised by Frank Wood; an M.A. in Statistics from Columbia University; and a B.A. in Mathematics from Amherst College. Reading Club online @10:00 1. E. Bingham et al.: Pyro: Deep Universal Probabilistic Programming, Journal of Machine Learning Research, 2019. 2. D. Tran et al.: Edward: A library for probabilistic modeling, inference, and criticism, arXiv, 2017. 3. N. Goodman, V. Mansinghka, D. Roy, K. onawitz, J. Tenenbaum: Church: a language for generative models. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, UAI, 2008. 4. F. Wood, J. van de Meent, V. Mansinghka: A new approach to probabilistic programming inference, AISTATS, 2014. 5. B. Paige and F. Wood: A compilation target for probabilistic programming languages, ICML, 2014. *6. J. Ai et al.: HackPPL: a universal probabilistic programming language, MAPL, 2019.
Matthew Guest
(slides)
*8. Ge, H., Xu, K. and Ghahramani, Z.: Turing: A language for flexible probabilistic inference, AISTATS, 2018.
Sean Parker
(slides)
10. T. Rainforth et al.: Bayesian Optimization for Probabilistic Programs, NIPS, 2016. *11. M. Balandat et al.: BOTORCH: Bayesian Optimization in PyTorch, Arxiv 2020. 2020/11/19 Session 7: Optimisation of Computer Systems using ML
Franciszek Budrowski
(slides)
*1.2. A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. Le and J. Dean: A Hierarchical Mode for Device Placement, ICLR, 2018. *2. A. Mirhoseini and A. Goldie: Chip Placement with Deep Reinforcement Learning, ISPD, 2020. *3. R. Addanki, S. B. Venkatakrishnan, S. Gupta, H. Mao, M. Alizadeh : Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning , arXiv, 2019. 4. F. Yang et al.: LFTF: A Framework for Efficient Tensor Analytics at Scale, VLDB, 2017.
7. I. Gog, M. Schwarzkopf, A. Gleave, R.
Watson, S. Hand: Firmament: fast, centralized cluster scheduling at scale, OSDI,
2016. 9. C. Delimitrou et al.: Quasar: ResourceEfficient and QoSAware Cluster Management, ASPLOS, 2014. 10. H. Mao et al.: Neural Adaptive Video Streaming with Pensieve, SIGCOMM, 2017. 11. S. Venkataraman et al.: Ernest: Efficient Performance Prediction for LargeScale Advanced Analytics, NSDI, 2016. 12. K. LaCurts et al.: Cicada: Introducing Predictive Guarantees for Cloud Networks, HOTCLOUD, 2014. 13. H. Hoffmann et al.: Dynamic Knobs for Responsive PowerAware Computing, Asplos, 2011. 14. N.J. Yadwadkar, B. Hariharan, J. Gonzalez and R. Katz: Faster Jobs in Distributed Data Processing using MultiTask Learning, SDM, 2015. 15. X. Dutreih et al.: Using Reinforcement Learning for Autonomic Resource Allocation in Clouds: Towards a Fully Automated Workflow, ICAS, 2011. 16. J. Eastep et al.: Smart Data Structures: An Online Machine Learning Approach to Multicore Data Structures, ICAC, 2011. 17. H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, M. Alizadeh: Learning Scheduling Algorithms for Data Processing Clusters, SIGCOMM, 2019. 18. E. Ipek et al.: SelfOptimizing Memory Controllers: A Reinforcement Learning Approach, ISCA, 2008. 19. S. Teerapittayanon et al.: Distributed Deep Neural Networks over the Cloud, the Edge and End Devices, ICDCS, 2017. 20. B. Zoph et al.: Learning Transferable Architectures for Scalable Image Recognition, arXiv, 2017. 21. D. Golovin et al.: Google Vizier: A Service for BlackBox Optimization, KDD, 2017. 22. D. Baylor et al.: TFX: A TensorFlowBased ProductionScale Machine Learning Platform, KDD, 2017. 23. H. Mao et al.: Resource Management with Deep Reinforcement Learning, HotNets, 2016. 24. M. Raghu et al.: On the Expressive Power of Deep Neural Networks, PMLR, 2017. 25. D. Aken et al.: Automatic Database Management System Tuning Through Largescale Machine Learning, SIGMOD, 2017.
26. A. Pavlo et al.:
SelfDriving Database
Management Systems, CIDR, 2017. 28. L. Espeholt et al.:
IMPALA: Scalable
Distributed DeepRL with Importance Weighted ActorLearner Architectures,
ICML, 2018. 29. M. Carvalho et al.: Longterm SLOs for reclaimed cloud computing resources, SOCC, 2014.
*30. A. Ratner, S. Bach, H.
Ehrenberg, J. Fries, S. Wu, and C. Ré:
Snorkel: Rapid Training
Data Creation with Weak Supervision, VLDB, 2017. *31. A. Ratner, B. Hancock, J.
Dunnmon, R. Goldman, and C. Ré:
Snorkel MeTaL: Weak
Supervision for MultiTask Learning, DEEM, 2018. 32. A. Koliousis, P. Watcharapichat, M. Weidlich, L. Mai, P. Costa, P. Pietzuch: CROSSBOW: Scaling Deep Learning with Small Batch Sizes on MultiGPU Servers, VLDB, 2019. 33. A. Floratou et al.: Dhalion: selfregulating stream processing in Heron, VLDB, 2017. 34. T. Li, Z. Xu, J. Tang and Y. Wang: ModelFree Control for Distributed Stream Data Processing using Deep Reinforcement Learning, VLDB, 2018. 35. E. Lambart et al.: Low Level Control of a Quadrotor with Deep ModelBased Reinforcement Learning, IEEE Robotics and Automation Letters, 2019. 36. Y. Kang et al.: Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge, ASPLOS, 2017. 37. Y. You et al.: Scaling Deep Learning on GPU and Knights Landing clusters, SC, 2017.
Ross Tooley
(slides)
Zhuang Zhang
(slides)
40. K. Tzoumas, A. Deshpande, and C. S. Jensen: Efficiently adapting graphical models for selectivity estimation, VLDB, 2013. 2020/11/26 Session 8: Presentation of Open Source Project Study
Wrapup Discussion Coursework 1 (Reading Club)The reading club will require you to read between 1
and 3 papers every week. You need to fill out simple review_log
(MS word format,
text
format) prior to each session and email me by Wednesday 12:00 (noon). The minimum requirement of review_log is one per session, but
you can read as many as you want and fill the review_log for each paper you
read. review_log is not marked but 'tick'.
Coursework 2 (Reports)The following three reports are required, which could be extended from the reading assignment of the reading club or a different one within the scope of data centric networking.
The report 1 should be handed in by November 13, 2020  16:00 and the report 2 by December 4, 2020  16:00 . The report 3 should be by the end of the Michaelmas term (January 20, 2021  16:00  but if you could finish by 21st of December, 2020 that will be good!). AssessmentThe final grade for the course will be provided as a letter grade or percentage and the assessment will consist of two parts:
Open Source ProjectsSee the candidates of Open Source Projects in data centric networking. The list is not exhausted. If you take anything other than the one in the list, please discuss with me. The purpose of this assignment is to understand the prototype of the proposed architecture, algorithms, and systems through running an actual prototype and present/explain to the other people how the prototype runs, any additional work you have done including your own applications and setup process of the prototype. This experience will give you better understanding of the project. These Open Source Projects come with a set of published papers and you should be able to examine your interests in the paper through running the prototype. Some projects are rather large and may require extensive environment and time; make sure you are able to complete this assignment. How to Read/Review a PaperThe following papers aid how to read/review a paper.
Further supplement: see ‘how to read/review a paper’ section in Advanced Topics in Computer Systems by Steven Hand. PresentationsPresentations should be about 1520 minutes long, where you need to cover the following aspects.
The following document aids in presenting a review.
How to write a survey paperA survey paper provides the readers with an exposition of existing work that is comprehensive and organized. It must expose relevant details associated in the surveying area, but it is important to keep a consistent level of details and to avoid simply listing the different works. Thus a good survey paper should demonstrate a summary of recent research results in a novel way that integrates and adds understanding to work in the field. For example, you can take an approach by classifying the existing literature in your own way; develop a perspective on the area, and evaluate trends. Thus, after defining the scope of your survey, 1) classify and organize the trend, 2) critical evaluation of approaches (pros/cons), and 3) add your analysis or explanation (e.g. table, figure). Also adding reference and pointer to further indepth information is important (summary from Rich Wolski’s note). Papers for OS Principles (Distributed Storage and Deterministic Parallelism)
3. F. Chang et al: BigTable: A Distributed Storage System for Structured Data, USENIX OSDI 2006. 4. G. DeCandia et al: Dynamo: Amazon's Highly Available Keyvalue Store, ACM SOSP 2007.
6. A. Aviram, et al: Efficient SystemEnforced Determistic Parallelism, USENIX OSDI 2010. 7. T. Liu et al: Dthreads: Efficient and Determistic Multithreading, ACM SOSP 2011. Contact EmailPlease email to eiko.yoneki@cl.cam.ac.uk for the question. 