Computer Laboratory

Research Overview

My research focuses around extracting multiple forms of parallelism from applications through code transformations and architectural enhancements. Here I give an overview of my current and previous work. See my group page for details on my research group.

Automatic Parallelisation

HELIX logo With the advent of multi and many-core processors, there is a growing need for tools that can extract parallelism from general sequential codes. Automatic parallelisation is a seductive route to achieving this, since it removes the burden of identifying and exploiting parallelism from the programmer. However, automatically parallelising ordinary, irregular, single-threaded codes and obtaining significant speedups has historically been an extremely difficult challenge.

My most notable research in this direction has been on the HELIX parallelising compiler. HELIX is a fully automatic, non-speculative loop parallelisation technique for generic sequential code. It identifies the most profitable loops to parallelise, taking account of the cross-iteration data dependences that they contain, as well as the amount of work that they perform. We developed the HELIX compiler and architectural support for fast communication between cores (called a ring cache), which allows the whole system to achieve unrivalled speedups for non-numerical (i.e. more irregular) applications.

Our work on HELIX has largely ceased due to the difficulties in getting applications into its source form (HELIX was implemented in a prototype compiler that took in CIL bytecode as source). My research now focuses on techniques to extract loop-level parallelism from application binaries, and to develop compiler schemes that go beyond HELIX by understanding the differences between how compilers and programmers parallelise code.


Prefetching (Memory-Level Parallelism)

Graph prefetcher Many modern workloads for high-performance compute (HPC) and data processing are heavily memory-latency bound, due to the von Neumann bottleneck in conventional multiprocessors. The traditional solution to overcome this has been prefetching: using hardware to detect common access patterns such as strides so as to bring the required data into fast cache memory before it is requested by the processor. However, existing techniques do not work well for irregular access patterns, as seen in linked data structures, and also in indirect memory accesses, where the addresses loaded are based on indices stored in arrays.

My group has developed both hardware and software schemes to aid in prefetching these indirect memory access patterns. Our hardware prefetcher for breadth-first searches on graphs sits alongside the L1 cache and uses data structure knowledge to prefetch accurately far ahead in the computation. Breadth-first search is a highly common compute kernel used in social media analysis, web crawling, model checking and many other fields, as it can be used as the basis of a large number of algorithms. A more general software scheme targets any access pattern using an array for indirection, and is available as an open-source pass for LLVM.



PSLP Single instruction multiple data (SIMD) instruction sets provide energy-efficient performance by taking advantage of data-level parallelism whereby each instruction is applied to a vector of data. In the past, my group has looked at schemes to increase the performance of SLP vectorisation, which is an algorithm to vectorise straight-line code (in contrast to vectorising loops). We achieved this by both adding extra, redundant instructions and by reducing the size of the graphs that the algorithm builds when identifying opportunities for vectorisation. We are now considering techniques to increase the coverage of vectorisation in general, through additional hardware and compiler support for it.



REPAIR Modern day computer systems have benefited from being designed and manufactured using an ever-increasing budget of transistors with very reliable integrated circuits. However, moving forwards such a “free lunch” is over. As transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening system lifespans.

We are tackling this challenge through both hardware and software schemes. For example, COMET is a technique to improve the performance of multi-threaded reliability code that is inserted automatically by the compiler. In addition, we're looking at schemes to take advantage of parallelism to reduce the area overheads of redundancy.


Compiler and Microarchitectural Design

Dynamic adaptation Designing a new microarchitecture, or a compiler for it, is a time-consuming task. The best architectural configuration and compiler optimisations vary depending on the applications run. We investigated the use of machine learning to aid this design space exploration, created a portable optimising compiler and then developed an online scheme to dynamically adjust the microarchitecture to different phases of an application.



Smart cache Our prior work on caching has focused on improving the energy efficiency of the different caches in the system. This has mostly been in hardware, with some compiler work too. We've proposed different ways of mapping data into the cache so as to reduce tag energy during lookup, and a link-time scheme to remove tag checks entirely for some cache accesses.


Photonic Networks

Optical NoC Optical networks-on-chip are a promising future technology to increase the performance of data movement. However, photonic interconnects do not currently allow buffering in the optical domain, so circuits must be set up before use. My work with Philip Watts and Anouk van Laer at UCL tried to address this by predicting when communication was going to occur and setting up the circuits in advance, reducing the latency of messages.


Compiler-Directed Energy Efficiency

Register cache My PhD developed schemes for power and energy saving within an out-of-order superscalar processor using compiler-generated hints. I looked at two power-hungry structures, namely the issue queue and register file, proposing methods to identify when parts of them wouldn't be used so that they could be turned off, realising energy savings.