09:00 - 09:30 Coffee &
Biscuits
09:30 - 09:45
Opening
(slides)
09:45 - 10:30 Applications (High
Performance Data Analytics)
09:45 - 10:15 Juha Jaykka (Univ.
Cambridge COSMOS Intel Parallel Computing Centre): Advantages and
disadvantages of modern parallel computing
(slides)
10:15 - 10:30 Discussion
10:30 - 11:15
System Architecture I (HPC)
10:30 - 11:00 Christophe
Dubach (Univ. Edinburgh): Lift: a Data-Parallel Language for
High-Performance Parallel Pattern Code Generation
(slides)
11:00 - 11:15 Discussion
11:15 - 11:30
Coffee Break
11:30 - 12:15
System Architecture II (Scheduling)
11:30 - 12:00 Daniel Goodman
(Oracle Labs): Fine-grained parallel work scheduling in scale-up
graph analytics
(slides)
12:00 - 12:15 Discussion
12:15 - 13:00
Lunch
13:00 - 14:30
System Architecture III (Distributed Computing/Memory
Management)
13:00 - 13:30 Aleksandar
Dragojevic (Microsoft Research Cambridge): FaRM: a platform for
low-latency computing
(slides)
13:30 - 13:45 Sam Ainsworth
(Univ. Cambridge):Graph Prefetching Using Data Structure Knowledge
(slides)
13:45 - 14:15 Rajeev Raman (Univ.
Leicester): In-memory memory processing of big data via succinct
data structures
(slides)
14:15 - 14:30 Discussion
14:30 - 14:45
Coffee Break
14:45 - 16:00
Heterogeneous Cores (Optimisation in Stream Processing and
Neural Networks)
14:45 - 15:10 Alexandros
Koliousis (Imperial College London): SABER: Window-Based Hybrid
Stream Processing for Heterogeneous Architectures
(slides)
15:10 - 15:30 Valentin Dalibard
(Univ. Cambridge): Modern Systems for Neural Networks
(slides)
15:30 - 16:00 Discussion
16:00 - 16:30
Discussion of future vision of data
processing stack from hardware, low-level programming, parallel
programming platform, to applications
16:30 - 17:30
Closing + wine
Talk Abstracts
Juha Jaykka (University of Cambridge COSMOS Intel Parallel
Computing Centre)
High Performance Data Analytics on
Distributed Parallel Platforms
Abstract: Traditional HPC, and by
extension Distributed Parallel computing Platforms, are largely
designed for the needs of relatively strictly structured data and
computationally demanding programs. This has consequences for HPDA,
and other emerging uses of these platforms/architectures as data is
often unstructured — in traditional HPC sense it is not even
unstructured, it is something even harder to handle from
architectural point of view. I will discuss some of the advantages
and disadvantages of such distributed platforms from the point of
view of requirements of data intensive, low computational intensity,
unstructured data.
Bio: Juha currently manages the COSMOS
Intel Parallel Computing Center and the COSMOS supercomputers at the
Stephen Hawking Centre for Theoretical Cosmology (CTC). His research
interests lay in theoretical physics and in enabling Big Data
scientists get the most out of their HPC/HPDA systems using
portable, sustainable, efficient, and optimised software. He has a
PhD in theoretical physics and has previously worked on mathematical
physics, classical field theories, and large scale HPC simulations.
Christophe Dubach (University of Edinburgh)
Lift: a Data-Parallel Language for
High-Performance Parallel Pattern Code Generation
Abstract: Algorithmic patterns have
emerged as a solution to exploit parallel hardware and achieve
performance portability. Applications can easily be expressed at a
high-level, hiding hardware complexity away from programmers and
shifting the responsibility to the library writer or compiler.
However, producing efficient implementations remains a complicated
task that needs to be repeated for each high-level pattern and
whenever hardware changes.
In this talk, I will present Lift, a
novel high-level data-parallel programming model. The language is
based on a surprisingly small set of functional "elementary"
primitives which can be combined to define higher-level algorithmic
patterns. A system of rewrite-rules is used to derive
device-specific optimised low-level implementations of the
algorithmic patterns. The rules encode both algorithmic choices and
low-level optimisations in a unified system and let the compiler
explore the optimisation space automatically. Preliminary results
show this approach produces GPU code that matches the performance of
highly tuned implementations of several computational kernels
including linear algebra operations.
Bio: Christophe Dubach received his
Ph.D in Informatics from the University of Edinburgh in 2009 and
holds a M.Sc. degree in Computer Science from EPFL (Switzerland). He
is a Lecturer (Assistant Professor) in the Institute for Computing
Systems Architecture at the University of Edinburgh (UK). In 2010 he
spent one year as a visiting researcher at the IBM Watson Research
Center (USA) working on the LiquidMetal project.
His current research interests
includes high-level programming models for heterogeneous systems,
co-design of both computer architecture and optimising compiler
technology, adaptive microprocessor, and the application of machine
learning in these areas.
Daniel Goodman (Oracle Labs)
Fine-grained parallel work scheduling
in scale-up graph analytics
Abstract: We introduce Callisto-RTS, a
parallel runtime system designed for multi-socket shared-memory
machines. It supports very fine-grained scheduling of parallel
loops—down to batches of work of around 1K cycles. Fine-grained
scheduling helps avoid load imbalance while reducing the need for
tuning workloads to particular machines or inputs. We use per-core
iteration counts to distribute work initially, and a new
asynchronous request combining technique for when threads require
more work. We present results using graph analytics algorithms on a
2-socket Intel 64 machine (32 h/w contexts), and on an multi-socket
SPARC machines (256 h/w contexts per socket). In addition to
reducing the need for tuning, on the SPARC machines we improve
absolute performance by up to 39% (compared with OpenMP). On both
architectures Callisto-RTS provides improved scaling and performance
compared with a state-of-the-art parallel runtime system (Galois).
Bio: Prior to joining Oracle in 2014
Daniel Goodman has worked on a range of projects in academia and
industry. His wider research interests are novel computation models
and user friendly programming models. Most recently he worked on the
TeraFlux project, looking at programming models for combining
dataflow and transactional Memory as a means of handling large
thread counts in unstructured problems. Within this project he
produced a suite of software transactional memories for Scala,
Manchester University Transactions for Scala (MUTS) and a Scala
based dataflow library, DFScala. He has held a Junior Research
Fellow at the Oxford e-Research Centre, Oxford University, where his
work focused on constructing tools and programming constructs/models
that made high performance computing more accessible to application
scientists in areas ranging from astronomy, to medical imagery to
simulating the visual cortex.
Aleksandar Dragojevic (Microsoft Research
Cambridge)
FaRM: a platform for low-latency
computing
Abstract: FaRM is a new main memory
distributed computing platform that exploits RDMA communication to
improve both latency and throughput by an order of magnitude
relative to state of the art main memory systems that use TCP/IP.
FaRM exposes the memory of machines in the cluster as a shared
address space. Applications execute distributed transactions that
allocate, read, write, and free objects in the address space. We
have recently started building a graph store, called A1, on top of
FaRM to support real-time data analytics. A1 relies on FaRM
transactions to keep data consistent and performs traversals on the
latest snapshot of data to ensure results are current.
Bio: I work as a researcher in Systems
and Networking group at MSR Cambridge. My main interest is building
scalable parallel and distributed systems. I prefer staying on the
practical side of problems and enjoy implementing real systems
ranging from simple script based solutions to everyday problems to
complex production systems. Lately I have been really excited about
several emerging hardware trends: large main memories of modern
machines, very fast networks and interconnects, and increasingly
higher integration of components on chips. I am mostly interested in
how these novel hardware developments impact software, as we need to
change the software stacks, from the operating and runtime systems
to the applications, in order to best utilize them. I received my
PhD from EPFL Switzerland in 2012, where I worked on performance of
software transactional memory. Before that, I received my Graduate
Electrical Engineer diploma from University of Novi Sad, Serbia in
2004.
Sam Ainsworth (University of Cambridge)
Graph Prefetching Using Data Structure
Knowledge
Abstract: Searches on large graphs are
heavily memory latency bound, as a result of many high latency DRAM
accesses. Due to the highly irregular nature of the access patterns
involved, caches and prefetchers, both hardware and software,
perform poorly on graph workloads. This leads to CPU stalling for
the majority of the time. However, in many cases the data access
pattern is well defined and predictable in advance, many falling
into a small set of simple patterns. Although existing implicit
prefetchers cannot bring significant benefit, a prefetcher armed
with knowledge of the data structures and access patterns could
accurately anticipate applications' traversals to bring in the
appropriate data in a latency tolerant manner.
We present the design of an explicitly
configured hardware prefetcher to improve performance for
breadth-first searches and sequential iteration on the efficient and
commonly-used compressed sparse row graph format. By snooping L1
cache accesses from the core and reacting to data returned from its
own prefetches, the prefetcher can schedule timely loads of data in
advance of the application needing it. For a range of applications
and graph sizes, our prefetcher achieves average speedups of 2.3x,
and up to 3.3x. We further show how the idea scales to other areas
of interest to the big data community, such as database traversals.
Bio: Sam is a second-year PhD student
at the Computer Laboratory, University of Cambridge, in the Computer
Architecture group. His research looks at architectural and compiler
techniques for data prefetching, both in software and in hardware,
particularly for irregular and big data workloads.
Rajeev Raman
(University of Leicester)
In-memory memory processing of big data via succinct
data structures
Abstract: We often use "Big Data"
techniques when most users only have "big data". "big data" can
often be handled efficiently by applying standard algorithms
developed, tried and tested but coupled with succinct data
structures to reduce the memory usage of such algorithms, thus
allowing the "big data" to be processed in memory. I will introduce
succinct data structures and some recent applications to mining "big
data".
Bio: Rajeev Raman defended his PhD from
the University of Rochester, Rochester, NY in 1991. He was a
postdoc at the Max-Planck-Insititut fuer Informatik and at the
University of Maryland Institute for Advanced Computer Studies
before joining King's College London as a Lecturer in 1994. He
joined Leicester as a Professor in 2001, where he has been ever
since. He served as Head of Department from 2003-06.
He has been a visitor at the Max-Planck-Institut, HKUST, the Courant
Institute NYU, Rutgers University, Kyoto University and Melbourne.
His research interests include data structures (particularly
succinct data structures), parallel algorithms and data mining.
Alexandros Koliousis
(Imperial College London)
SABER: Window-Based Hybrid Stream
Processing for Heterogeneous Architectures
Abstract: Modern servers have become
heterogeneous, often combining multi-core CPUs with many-core GPUs.
Such heterogeneous architectures have the potential to improve the
performance of data-intensive stream processing applications, but
they are not supported by current relational stream processing
engines. For an engine to exploit a heterogeneous architecture, it
must execute streaming SQL queries with sufficient data-parallelism
to fully utilise all available heterogeneous processors, and decide
how to use each in the most effective way. It must do this while
respecting the semantics of streaming SQL queries, in particular
with regard to window handling. In this talk, I will present SABER,
a hybrid high-performance relational stream processing engine for
CPUs and GPUs. SABER executes window-based streaming SQL queries in
a data-parallel fashion using all available CPU and GPU cores.
Instead of statically assigning query operators to heterogeneous
processors, SABER employs a new adaptive heterogeneous look-ahead
scheduling strategy, which increases the share of operators
executing on the processor that yields the highest performance. To
hide data movement costs, SABER pipelines the transfer of stream
data between CPU and GPU memory. Our experimental comparison against
state-of-the-art engines shows that SABER increases processing
throughput while maintaining low latency for a wide range of
streaming SQL queries with both small and large window sizes.
Bio: Alexandros Koliousis is a Research
Associate at the Department of Computing, Imperial College London.
He is a member of the Large-Scale Distributed Systems group. His
research is currently focused on engineering data parallel
processing systems that make best use of heterogeneous processors.
Prior to joining Imperial College, he was working on the unification
of publish/subscribe systems and streaming databases for complex
event processing at the University of Glasgow, from where he also
obtained his MSc and PhD degrees.
Valentin Dalibard
(University of Cambridge)
Modern Systems for Neural Networks
Abstract: In this talk, I'll describe
the current landscape of software available to train Deep Neural
Networks (DNNs) in local and distributed settings, and go over some
of their current trends. I will also discuss the outcomes of a
recent project in which we used Bayesian Optimization to tune the
distributed scheduling of DNNs on heterogeneous hardware.
Bio: Valentin Dalibard is a fourth
year PhD student in the Systems Research Group at the University of
Cambridge, under the supervision of Dr Eiko Yoneki. He is interested
in applying modern statistical models to computations, such as the
ones performed by distributed systems, and using these models to
make better static and runtime decisions.
|