Research Overview
My research focuses around extracting multiple forms of parallelism from applications through code transformations and architectural enhancements. Here I give an overview of my current and previous work. See my group page for details on my research group and my publications page for conference, journal and workshop papers.
Arm recently published an article about our collaboration and, co-incidentally, the university also published a piece about the wider collaboration with Arm, of which I am a part.
Security
Security of computer systems and the data that they process has slowly been becoming more important.
Then in January 2018 the disclosure of Spectre and Meltdown vulnerabilities rocketed security to a first-order design constraint.
We have done work to tackle a variety of security-related issues, from low-cost Spectre mitigations, to avoiding use-after-free vulnerabilities that manifest in C code (and other manually memory-managed languages) when programmers do not (or the programs are too complicated to allow them to) reason carefully enough about when memory should be freed. MuonTrap is a scheme to isolate speculative state within a small filter cache between the processor core and first level of the cache hierarchy, without introducing additional side channels that could leak information. The Guardian Council is a programmable solution for security monitoring, allowing operating-system-verified kernels to constantly check the execution of code running on a high-performance core. MarkUs is a drop-in replacement for C's memory allocator, providing a quarantine facility when memory is freed in which to hold objects until we can be sure there are no dangling pointers to them. Partnering with the team that developed CHERI, we have built on MarkUs to develop CHERIvoke and Cornucopia, both of which leverage CHERI capabilities to provide fast pointer revocation.
Publications
- MuonTrap Spectre mitigation (ISCA 2020)
- MarkUs use-after-free prevention (Oakland 2020)
- Cornucopia capability revocation (Oakland 2020)
- The Guardian Council (ASPLOS 2020)
- CHERIvoke capability revocation (MICRO 2019)
Prefetching (Memory-Level Parallelism)
Many modern workloads for high-performance compute (HPC) and data processing are heavily memory-latency bound, due to the von Neumann bottleneck in conventional multiprocessors.
The traditional solution to overcome this has been prefetching: using hardware to detect common access patterns such as strides so as to bring the required data into fast cache memory before it is requested by the processor.
However, existing techniques do not work well for irregular access patterns, as seen in linked data structures, and also in indirect memory accesses, where the addresses loaded are based on indices stored in arrays.
My group has developed both hardware and software schemes to aid in prefetching these indirect memory access patterns. Our hardware prefetcher for breadth-first searches on graphs sits alongside the L1 cache and uses data structure knowledge to prefetch accurately far ahead in the computation. Breadth-first search is a highly common compute kernel used in social media analysis, web crawling, model checking and many other fields, as it can be used as the basis of a large number of algorithms. A more general software scheme targets any access pattern using an array for indirection, and is available as an open-source pass for LLVM. Most recently we've look again at runahead execution (carrying on speculative execution when the ROB becomes full) and have developed a scheme to vectorise these future instructions, taking advantage of the data-level parallelism in the processor to gain memory-level parallelism in prefetching.
Publications
- Vector runahead (ISCA 2021)
- Prefetching in functional languages (ISMM 2020)
- Prefetching unstructured mesh applications (IA3 2018), extended version in TOPC 2020
- Programmable prefetchers (ASPLOS 2018)
- Software prefetching (CGO 2017), extended version in TOCS 2019
- Hardware breadth-first graph prefetcher (ICS 2016)
Reliability
Modern day computer systems have benefited from being designed and manufactured using an ever-increasing budget of transistors with very reliable integrated circuits.
However, moving forwards such a “free lunch” is over.
As transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening system lifespans.
We are tackling this challenge through both hardware and software schemes. Our work in DSN 2018 proposed a heterogeneous architecture to dramatically reduce the overheads of hardware error detection compared to industry-standard dual-core lockstep. COMET is a technique to improve the performance of multi-threaded reliability code that is inserted automatically by the compiler.
Publications
- ParaDox performance by increasing errors (HPCA 2021)
- ParaMedic error correction (DSN 2019)
- Heterogeneous error detection (DSN 2018)
- Predictive instruction re-execution (DFT 2017)
- COMET (CASES 2016)
- Lynx queue (ICS 2016)
- Cache wearout reduction (CAL 2016), extended version in TVLSI 2017
- REPAIR (DFT 2015)
- Code impact on voltage noise (SELSE 2013)
- Optimisation impact on AVF (IMPACT 2008)
Automatic Parallelisation
With the advent of multi and many-core processors, there is a growing need for tools that can extract parallelism from general sequential codes.
Automatic parallelisation is a seductive route to achieving this, since it removes the burden of identifying and exploiting parallelism from the programmer.
However, automatically parallelising ordinary, irregular, single-threaded codes and obtaining significant speedups has historically been an extremely difficult challenge.
My most notable research in this direction has been on the HELIX parallelising compiler. HELIX is a fully automatic, non-speculative loop parallelisation technique for generic sequential code. It identifies the most profitable loops to parallelise, taking account of the cross-iteration data dependences that they contain, as well as the amount of work that they perform. We developed the HELIX compiler and architectural support for fast communication between cores (called a ring cache), which allows the whole system to achieve unrivalled speedups for non-numerical (i.e. more irregular) applications.
Our work on HELIX has largely ceased due to the difficulties in getting applications into its source form (HELIX was implemented in a prototype compiler that took in CIL bytecode as source). My research now focuses on techniques to extract loop-level parallelism from application binaries, and to develop compiler schemes that go beyond HELIX by understanding the differences between how compilers and programmers parallelise code.
Publications
- Janus parallelism transformations (VEE 2019)
- Janus automatic parallelisation (CGO 2019)
- Analysing loop-carried dependences (CC 2016)
- Ring Cache (ISCA 2014), research highlight in CACM 2017
- HELIX overview (DAC 2012)
- HELIX (CGO 2012) and an overview in IEEE Micro 2012
Vectorisation
Single instruction multiple data (SIMD) instruction sets provide energy-efficient performance by taking advantage of data-level parallelism whereby each instruction is applied to a vector of data.
In the past, my group has looked at schemes to increase the performance of SLP vectorisation, which is an algorithm to vectorise straight-line code (in contrast to vectorising loops).
We achieved this by both adding extra, redundant instructions and by reducing the size of the graphs that the algorithm builds when identifying opportunities for vectorisation.
We are now considering techniques to increase the coverage of vectorisation in general, through additional hardware and compiler support for it.
Publications
Compiler and Microarchitectural Design
Designing a new microarchitecture, or a compiler for it, is a time-consuming task.
The best architectural configuration and compiler optimisations vary depending on the applications run.
We investigated the use of machine learning to aid this design space exploration, created a portable optimising compiler and then developed an online scheme to dynamically adjust the microarchitecture to different phases of an application.
Publications
- Dynamic microarchitectural adaptation (MICRO 2010), extended version in TACO 2013
- Portable optimisation (MICRO 2009)
- Early-stage microarchitectural design (ICCD 2009)
- Predicting the architecture / optimisation space (CASES 2008), extended version in TECS 2012
- Microarchitectural design space exploration (MICRO 2007), extended version in TC 2011
Caching
Our prior work on caching has focused on improving the energy efficiency of the different caches in the system.
This has mostly been in hardware, with some compiler work too.
We've proposed different ways of mapping data into the cache so as to reduce tag energy during lookup, and a link-time scheme to remove tag checks entirely for some cache accesses.
Publications
- HALO post-link heap-layout optimisation (CGO 2020)
- Optimising sparse directories (DATE 2014)
- Region-aware cache partitioning (ICCD 2013)
- Co-operative partitioning (HPCA 2012)
- The migration prefetcher (HiPEAC and TACO 2012)
- Smart cache (SAMOS 2011), extended version in IJPP 2013
- Tagless instruction cache (CGO 2011)
- Compiler way-placement (DATE 2008)
Photonic Networks
Optical networks-on-chip are a promising future technology to increase the performance of data movement.
However, photonic interconnects do not currently allow buffering in the optical domain, so circuits must be set up before use.
My work with Philip Watts and Anouk van Laer at UCL tried to address this by predicting when communication was going to occur and setting up the circuits in advance, reducing the latency of messages.
Publications
- Multicast invalidates (AISTECS 2016)
- Coherence-based message prediction (DATE 2015)
- Towards zero-latency switching (SiPhotonics 2014)
- Full system simulation (OFC 2013)
Compiler-Directed Energy Efficiency
My PhD developed schemes for power and energy saving within an out-of-order superscalar processor using compiler-generated hints.
I looked at two power-hungry structures, namely the issue queue and register file, proposing methods to identify when parts of them wouldn't be used so that they could be turned off, realising energy savings.
Publications
- Register caching (TACO 2009)
- Designing efficient processors (INTERACT 2007)
- Early register release (PACT 2005), extended version in TACO 2009
- Issue queue energy reduction (HPCA 2005), extended version in HiPEAC Journal 2011
Misc
This is where the random stuff I've published ends up - perhaps it's just one-off papers in a particular subject, perhaps it's papers from work in progress that will eventually become its own category - we'll see!
Publications
Images from Pexels.