Department of Computer Science and Technology

Research Overview

My research focuses around extracting multiple forms of parallelism from applications through code transformations and architectural enhancements. Here I give an overview of my current and previous work. See my group page for details on my research group and my publications page for conference, journal and workshop papers.

Arm recently published an article about our collaboration and, co-incidentally, the university also published a piece about the wider collaboration with Arm, of which I am a part.

Security

Locked pink door Security of computer systems and the data that they process has slowly been becoming more important. Then in January 2018 the disclosure of Spectre and Meltdown vulnerabilities rocketed security to a first-order design constraint.

We have done work to tackle a variety of security-related issues, from low-cost Spectre mitigations, to avoiding use-after-free vulnerabilities that manifest in C code (and other manually memory-managed languages) when programmers do not (or the programs are too complicated to allow them to) reason carefully enough about when memory should be freed. MuonTrap is a scheme to isolate speculative state within a small filter cache between the processor core and first level of the cache hierarchy, without introducing additional side channels that could leak information. The Guardian Council is a programmable solution for security monitoring, allowing operating-system-verified kernels to constantly check the execution of code running on a high-performance core. MarkUs is a drop-in replacement for C's memory allocator, providing a quarantine facility when memory is freed in which to hold objects until we can be sure there are no dangling pointers to them. Partnering with the team that developed CHERI, we have built on MarkUs to develop CHERIvoke and Cornucopia, both of which leverage CHERI capabilities to provide fast pointer revocation.

Publications

Prefetching (Memory-Level Parallelism)

Parallel data Many modern workloads for high-performance compute (HPC) and data processing are heavily memory-latency bound, due to the von Neumann bottleneck in conventional multiprocessors. The traditional solution to overcome this has been prefetching: using hardware to detect common access patterns such as strides so as to bring the required data into fast cache memory before it is requested by the processor. However, existing techniques do not work well for irregular access patterns, as seen in linked data structures, and also in indirect memory accesses, where the addresses loaded are based on indices stored in arrays.

My group has developed both hardware and software schemes to aid in prefetching these indirect memory access patterns. Our hardware prefetcher for breadth-first searches on graphs sits alongside the L1 cache and uses data structure knowledge to prefetch accurately far ahead in the computation. Breadth-first search is a highly common compute kernel used in social media analysis, web crawling, model checking and many other fields, as it can be used as the basis of a large number of algorithms. A more general software scheme targets any access pattern using an array for indirection, and is available as an open-source pass for LLVM. Most recently we've look again at runahead execution (carrying on speculative execution when the ROB becomes full) and have developed a scheme to vectorise these future instructions, taking advantage of the data-level parallelism in the processor to gain memory-level parallelism in prefetching.

Publications

Reliability

Addition error Modern day computer systems have benefited from being designed and manufactured using an ever-increasing budget of transistors with very reliable integrated circuits. However, moving forwards such a “free lunch” is over. As transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening system lifespans.

We are tackling this challenge through both hardware and software schemes. Our work in DSN 2018 proposed a heterogeneous architecture to dramatically reduce the overheads of hardware error detection compared to industry-standard dual-core lockstep. COMET is a technique to improve the performance of multi-threaded reliability code that is inserted automatically by the compiler.

Publications

Automatic Parallelisation

Threads With the advent of multi and many-core processors, there is a growing need for tools that can extract parallelism from general sequential codes. Automatic parallelisation is a seductive route to achieving this, since it removes the burden of identifying and exploiting parallelism from the programmer. However, automatically parallelising ordinary, irregular, single-threaded codes and obtaining significant speedups has historically been an extremely difficult challenge.

My most notable research in this direction has been on the HELIX parallelising compiler. HELIX is a fully automatic, non-speculative loop parallelisation technique for generic sequential code. It identifies the most profitable loops to parallelise, taking account of the cross-iteration data dependences that they contain, as well as the amount of work that they perform. We developed the HELIX compiler and architectural support for fast communication between cores (called a ring cache), which allows the whole system to achieve unrivalled speedups for non-numerical (i.e. more irregular) applications.

Our work on HELIX has largely ceased due to the difficulties in getting applications into its source form (HELIX was implemented in a prototype compiler that took in CIL bytecode as source). My research now focuses on techniques to extract loop-level parallelism from application binaries, and to develop compiler schemes that go beyond HELIX by understanding the differences between how compilers and programmers parallelise code.

Publications

Vectorisation

Running track Single instruction multiple data (SIMD) instruction sets provide energy-efficient performance by taking advantage of data-level parallelism whereby each instruction is applied to a vector of data. In the past, my group has looked at schemes to increase the performance of SLP vectorisation, which is an algorithm to vectorise straight-line code (in contrast to vectorising loops). We achieved this by both adding extra, redundant instructions and by reducing the size of the graphs that the algorithm builds when identifying opportunities for vectorisation. We are now considering techniques to increase the coverage of vectorisation in general, through additional hardware and compiler support for it.

Publications

Compiler and Microarchitectural Design

Schematic drawing Designing a new microarchitecture, or a compiler for it, is a time-consuming task. The best architectural configuration and compiler optimisations vary depending on the applications run. We investigated the use of machine learning to aid this design space exploration, created a portable optimising compiler and then developed an online scheme to dynamically adjust the microarchitecture to different phases of an application.

Publications

Caching

Coins Our prior work on caching has focused on improving the energy efficiency of the different caches in the system. This has mostly been in hardware, with some compiler work too. We've proposed different ways of mapping data into the cache so as to reduce tag energy during lookup, and a link-time scheme to remove tag checks entirely for some cache accesses.

Publications

Photonic Networks

Lights Optical networks-on-chip are a promising future technology to increase the performance of data movement. However, photonic interconnects do not currently allow buffering in the optical domain, so circuits must be set up before use. My work with Philip Watts and Anouk van Laer at UCL tried to address this by predicting when communication was going to occur and setting up the circuits in advance, reducing the latency of messages.

Publications

Compiler-Directed Energy Efficiency

Solar panels My PhD developed schemes for power and energy saving within an out-of-order superscalar processor using compiler-generated hints. I looked at two power-hungry structures, namely the issue queue and register file, proposing methods to identify when parts of them wouldn't be used so that they could be turned off, realising energy savings.

Publications

Misc

Abstract paint This is where the random stuff I've published ends up - perhaps it's just one-off papers in a particular subject, perhaps it's papers from work in progress that will eventually become its own category - we'll see!

Publications

Images from Pexels.