Date: July 14, 2016 (Thursday) 9:00 - 17:00 @ University of Cambridge Computer Laboratory (FW26) Register HERE!

Home

Organisers

Eiko Yoneki (University of Cambridge) eiko.yoneki@cl.cam.ac.uk
Thomas Heinis (Imperial College London) t.heinis@imperial.ac.uk
Timothy Jones (University of Cambridge) timothy.jones@cl.cam.ac.uk

Workshop Summary Descriptions

Large scale data analytics is emerging as a huge consumer of computational resources due to its complex, data-hungry algorithms. Especially graph/networked data analysis are becoming increasingly important for solving multiple problems in diverse fields. As these problems grow in scale, parallel computing resources are required to meet the computational and memory requirements. Notably, the algorithms, software and hardware that have worked well for developing mainstream parallel applications are not usually effective for massive-scale data from the real world, which exhibits more complex structure. Research into large scale data processing is currently at a fragmented stage. This workshop brings researchers from systems, computer architecture, algorithms and databases to discuss emerging trends and to identify opportunities for future advancement from data processing aspect. The workshop takes the format of presentations by key researchers and discussions on specific topics. The workshop has the following goals. Identify clear application areas and algorithms that are of interest and representative of large scale data processing as a whole. Close the gap between domain algorithms and systems researchers. In particular algorithm designers are often unaware of the constraints imposed by systems and the best way to consider these when designing graph algorithms for big (graph) data. On the other hand the systems community often misses advances in algorithm design that can be used to cut down processing time and scale up systems in terms of the size of the problem they can address. Build some consensus on programming paradigm. Currently this effort is fragmented between researchers building domain specific languagesor databases for executing and storing them and researchers trying to fit existing systems and means to program them to applications. Closing this gap is critical to become available to the wider network science community as well as to open up whole new research areas such as algorithm independent optimisation.

Outcome

The workshop will serve the purpose of unifying the currently fragmented research community that deals with large scale data processing and possibly open doors to high quality solutions for problem of exploiting the deluge of rich data facing us today. We want to ensure that the workshop generates the required amount of introspection in the community and produces useful output.

Program

Each session has a focused topic and the discussion follows a talk or talks. Each talk will be ~25 minutes (the students are shorter). Active participation to the discussion will be great!

09:00 - 09:30 Coffee & Biscuits

09:30 - 09:45 Opening (slides)

09:45 - 10:30 Applications (High Performance Data Analytics)

09:45 - 10:15 Juha Jaykka (Univ. Cambridge COSMOS Intel Parallel Computing Centre): Advantages and disadvantages of modern parallel computing (slides)

10:15 - 10:30 Discussion

10:30 - 11:15 System Architecture I (HPC)

10:30 - 11:00 Christophe Dubach (Univ. Edinburgh): Lift: a Data-Parallel Language for High-Performance Parallel Pattern Code Generation (slides)

11:00 - 11:15 Discussion

11:15 - 11:30 Coffee Break

11:30 - 12:15 System Architecture II (Scheduling)

11:30 - 12:00 Daniel Goodman (Oracle Labs): Fine-grained parallel work scheduling in scale-up graph analytics (slides)

12:00 - 12:15 Discussion

12:15 - 13:00 Lunch

13:00 - 14:30 System Architecture III (Distributed Computing/Memory Management)

13:00 - 13:30 Aleksandar Dragojevic (Microsoft Research Cambridge): FaRM: a platform for low-latency computing (slides)

13:30 - 13:45 Sam Ainsworth (Univ. Cambridge):Graph Prefetching Using Data Structure Knowledge (slides)

13:45 - 14:15 Rajeev Raman (Univ. Leicester): In-memory memory processing of big data via succinct data structures (slides)

14:15 - 14:30 Discussion

14:30 - 14:45 Coffee Break

14:45 - 16:00 Heterogeneous Cores (Optimisation in Stream Processing and Neural Networks)

14:45 - 15:10 Alexandros Koliousis (Imperial College London): SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures (slides)

15:10 - 15:30 Valentin Dalibard (Univ. Cambridge): Modern Systems for Neural Networks (slides)

15:30 - 16:00 Discussion

16:00 - 16:30 Discussion of future vision of data processing stack from hardware, low-level programming, parallel programming platform, to applications

16:30 - 17:30 Closing + wine

Talk Abstracts

Juha Jaykka (University of Cambridge COSMOS Intel Parallel Computing Centre)

High Performance Data Analytics on Distributed Parallel Platforms

Abstract: Traditional HPC, and by extension Distributed Parallel computing Platforms, are largely designed for the needs of relatively strictly structured data and computationally demanding programs. This has consequences for HPDA, and other emerging uses of these platforms/architectures as data is often unstructured — in traditional HPC sense it is not even unstructured, it is something even harder to handle from architectural point of view. I will discuss some of the advantages and disadvantages of such distributed platforms from the point of view of requirements of data intensive, low computational intensity, unstructured data.

Bio: Juha currently manages the COSMOS Intel Parallel Computing Center and the COSMOS supercomputers at the Stephen Hawking Centre for Theoretical Cosmology (CTC). His research interests lay in theoretical physics and in enabling Big Data scientists get the most out of their HPC/HPDA systems using portable, sustainable, efficient, and optimised software. He has a PhD in theoretical physics and has previously worked on mathematical physics, classical field theories, and large scale HPC simulations.

Christophe Dubach (University of Edinburgh)

Lift: a Data-Parallel Language for High-Performance Parallel Pattern Code Generation

Abstract: Algorithmic patterns have emerged as a solution to exploit parallel hardware and achieve performance portability. Applications can easily be expressed at a high-level, hiding hardware complexity away from programmers and shifting the responsibility to the library writer or compiler. However, producing efficient implementations remains a complicated task that needs to be repeated for each high-level pattern and whenever hardware changes.

In this talk, I will present Lift, a novel high-level data-parallel programming model. The language is based on a surprisingly small set of functional "elementary" primitives which can be combined to define higher-level algorithmic patterns. A system of rewrite-rules is used to derive device-specific optimised low-level implementations of the algorithmic patterns. The rules encode both algorithmic choices and low-level optimisations in a unified system and let the compiler explore the optimisation space automatically. Preliminary results show this approach produces GPU code that matches the performance of highly tuned implementations of several computational kernels including linear algebra operations.

Bio: Christophe Dubach received his Ph.D in Informatics from the University of Edinburgh in 2009 and holds a M.Sc. degree in Computer Science from EPFL (Switzerland). He is a Lecturer (Assistant Professor) in the Institute for Computing Systems Architecture at the University of Edinburgh (UK). In 2010 he spent one year as a visiting researcher at the IBM Watson Research Center (USA) working on the LiquidMetal project.

His current research interests includes high-level programming models for heterogeneous systems, co-design of both computer architecture and optimising compiler technology, adaptive microprocessor, and the application of machine learning in these areas.

Daniel Goodman (Oracle Labs)

Fine-grained parallel work scheduling in scale-up graph analytics

Abstract: We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines. It supports very fine-grained scheduling of parallel loops—down to batches of work of around 1K cycles. Fine-grained scheduling helps avoid load imbalance while reducing the need for tuning workloads to particular machines or inputs. We use per-core iteration counts to distribute work initially, and a new asynchronous request combining technique for when threads require more work. We present results using graph analytics algorithms on a 2-socket Intel 64 machine (32 h/w contexts), and on an multi-socket SPARC machines (256 h/w contexts per socket). In addition to reducing the need for tuning, on the SPARC machines we improve absolute performance by up to 39% (compared with OpenMP). On both architectures Callisto-RTS provides improved scaling and performance compared with a state-of-the-art parallel runtime system (Galois).

Bio: Prior to joining Oracle in 2014 Daniel Goodman has worked on a range of projects in academia and industry. His wider research interests are novel computation models and user friendly programming models. Most recently he worked on the TeraFlux project, looking at programming models for combining dataflow and transactional Memory as a means of handling large thread counts in unstructured problems. Within this project he produced a suite of software transactional memories for Scala, Manchester University Transactions for Scala (MUTS) and a Scala based dataflow library, DFScala. He has held a Junior Research Fellow at the Oxford e-Research Centre, Oxford University, where his work focused on constructing tools and programming constructs/models that made high performance computing more accessible to application scientists in areas ranging from astronomy, to medical imagery to simulating the visual cortex.

Aleksandar Dragojevic (Microsoft Research Cambridge)

FaRM: a platform for low-latency computing

Abstract: FaRM is a new main memory distributed computing platform that exploits RDMA communication to improve both latency and throughput by an order of magnitude relative to state of the art main memory systems that use TCP/IP. FaRM exposes the memory of machines in the cluster as a shared address space. Applications execute distributed transactions that allocate, read, write, and free objects in the address space. We have recently started building a graph store, called A1, on top of FaRM to support real-time data analytics. A1 relies on FaRM transactions to keep data consistent and performs traversals on the latest snapshot of data to ensure results are current.

Bio: I work as a researcher in Systems and Networking group at MSR Cambridge. My main interest is building scalable parallel and distributed systems. I prefer staying on the practical side of problems and enjoy implementing real systems ranging from simple script based solutions to everyday problems to complex production systems. Lately I have been really excited about several emerging hardware trends: large main memories of modern machines, very fast networks and interconnects, and increasingly higher integration of components on chips. I am mostly interested in how these novel hardware developments impact software, as we need to change the software stacks, from the operating and runtime systems to the applications, in order to best utilize them. I received my PhD from EPFL Switzerland in 2012, where I worked on performance of software transactional memory. Before that, I received my Graduate Electrical Engineer diploma from University of Novi Sad, Serbia in 2004.

Sam Ainsworth (University of Cambridge)

Graph Prefetching Using Data Structure Knowledge

Abstract: Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data in a latency tolerant manner.

We present the design of an explicitly configured hardware prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x. We further show how the idea scales to other areas of interest to the big data community, such as database traversals.

Bio: Sam is a second-year PhD student at the Computer Laboratory, University of Cambridge, in the Computer Architecture group. His research looks at architectural and compiler techniques for data prefetching, both in software and in hardware, particularly for irregular and big data workloads.

Rajeev Raman (University of Leicester)

In-memory memory processing of big data via succinct data structures

Abstract: We often use "Big Data" techniques when most users only have "big data". "big data" can often be handled efficiently by applying standard algorithms developed, tried and tested but coupled with succinct data structures to reduce the memory usage of such algorithms, thus allowing the "big data" to be processed in memory. I will introduce succinct data structures and some recent applications to mining "big data".

Bio: Rajeev Raman defended his PhD from the University of Rochester, Rochester, NY in 1991. He was a postdoc at the Max-Planck-Insititut fuer Informatik and at the University of Maryland Institute for Advanced Computer Studies before joining King's College London as a Lecturer in 1994. He joined Leicester as a Professor in 2001, where he has been ever since. He served as Head of Department from 2003-06. He has been a visitor at the Max-Planck-Institut, HKUST, the Courant Institute NYU, Rutgers University, Kyoto University and Melbourne. His research interests include data structures (particularly succinct data structures), parallel algorithms and data mining.

Alexandros Koliousis (Imperial College London)

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Abstract: Modern servers have become heterogeneous, often combining multi-core CPUs with many-core GPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism to fully utilise all available heterogeneous processors, and decide how to use each in the most effective way. It must do this while respecting the semantics of streaming SQL queries, in particular with regard to window handling. In this talk, I will present SABER, a hybrid high-performance relational stream processing engine for CPUs and GPUs. SABER executes window-based streaming SQL queries in a data-parallel fashion using all available CPU and GPU cores. Instead of statically assigning query operators to heterogeneous processors, SABER employs a new adaptive heterogeneous look-ahead scheduling strategy, which increases the share of operators executing on the processor that yields the highest performance. To hide data movement costs, SABER pipelines the transfer of stream data between CPU and GPU memory. Our experimental comparison against state-of-the-art engines shows that SABER increases processing throughput while maintaining low latency for a wide range of streaming SQL queries with both small and large window sizes.

Bio: Alexandros Koliousis is a Research Associate at the Department of Computing, Imperial College London. He is a member of the Large-Scale Distributed Systems group. His research is currently focused on engineering data parallel processing systems that make best use of heterogeneous processors. Prior to joining Imperial College, he was working on the unification of publish/subscribe systems and streaming databases for complex event processing at the University of Glasgow, from where he also obtained his MSc and PhD degrees.

Valentin Dalibard (University of Cambridge)

Modern Systems for Neural Networks

Abstract: In this talk, I'll describe the current landscape of software available to train Deep Neural Networks (DNNs) in local and distributed settings, and go over some of their current trends. I will also discuss the outcomes of a recent project in which we used Bayesian Optimization to tune the distributed scheduling of DNNs on heterogeneous hardware.

Bio: Valentin Dalibard is a fourth year PhD student in the Systems Research Group at the University of Cambridge, under the supervision of Dr Eiko Yoneki. He is interested in applying modern statistical models to computations, such as the ones performed by distributed systems, and using these models to make better static and runtime decisions.

Scalable Data Processing for Big Data from Laptop, Multi-core, to Cluster Computing

Cambridge Big Data Collaborative Workshop