Computer Architecture Group

ACS & Part II/III Project Suggestions

The following project suggestions are starting points for possible Part II, Part III and ACS projects. ACS and Part III projects require more of a research emphasis to be successful, where as Part II projects might focus more on computer engineering and reproducibility of results, though top marks tend to be awarded to more research oriented projects.

Please contact the proposer(s) by email if you are interested in any of the projects below. In addition, some of the projects from previous years may still be suitable and interesting. Please remember, these are just starting points that suggest possible directions for the resarch. You can continue to check here again over the coming weeks for more projects. We would also be happy to consider any project ideas you have too.

Timothy Jones maintains his own list of project suggestions here.

Suggestions for 2023/2024

Tiny RV64I
Contact: Prof Simon Moore

An alternative to Jon Woodruff's proposed a “serial” RISC-V from 2022/2023: Design a phsycally small 64-bit processor for I/O applications on SoCs. Motivation: I/O devices on many SoCs uses tiny 32b microcontrollers, which adds flexibility and allows in-the-field software/firmware upgrades, thereby mitigating risks by allowing bug fixes. However, some of these I/O devices need to be bus masters and on a 64b SoC with 48b or 56b physical addresses, the I/O processors cannot directly access all of the memory. One solution would be to use tiny 64b processors. Given that physical size rather than performance matters for many of these tiny cores, it could be useful to have a 64b processor with, say, an 8b data-path where one 64b operation takes eight clock cycles to complete. This project would explore implementation of RV64I (RISC-V 64b integer base instructions) with options to expand to other ISA extensions and possibly our CHERI security extension. This could be written in SystemVerilog, though a Bluespec SystemVerilog implemenation would make it easier to parameterise the design on data-path width.
Multiprogramming on an open-source RISC-V GPU
Contact: Dr Theo Markettos and Dr Paul Metzger.

SIMTight is an open-source Single Instruction Multiple Thread GPGPU-like core that runs on an FPGA. It contains a core able to run 2048 threads at the same time in a vector manner, and a scalar CPU that can used for launching workloads, management and other housekeeping. Both run the 32-bit RISC-V instruction set with CHERI extensions. At present SIMTight is only able to run a single workload at once.

On GPUs today, units of GPU compute (called 'kernels') are submitted from client applications and scheduled one-after-another, waiting for one to complete before scheduling the next. This can be problematic if they run for a long time, as other kernels will be blocked until the first finishes.

This project would consider ways to multiprogram SIMTight, so that multiple kernels can be run. For example, a pre-emptive GPU-side scheduler could transfer control from one kernel to another, based on a timer interrupt, saving and restoring GPU state to switch context in a safe manner. A scheduler would then allocate time to each kernel according to a scheduling policy. Additionally such a scheduler could accept work from client applications.

Different types of multiprogramming may be explored - for example, one-after-another scheduling, cooperative multitasking where kernels yield control, or kernels can be fragmented to reduce the response time.

At present, SIMTight does not have a strong application model on the CPU side - different applications must be linked together into a single binary. Another aspect would be to build means to launch multiple applications - for example, loading them from storage and launching them. With appropriate context switching between CPU applications this would allow codesign of CPU-side and GPU-side scheduling.

In advance of starting the project it would be advantageous to have done some background reading on the following:
- GPGPU programming with CUDA or OpenCL
- programming in assembler
- interrupt handlers and context switches in operating systems
This project is ideal for someone with an interest in low-level software, operating systems, or parallel programming.

Suggestions for 2022/2023

Many of these suggestions are still valid for 2023/2024

Vector Runahead
Contact: Timothy Jones

To address the widening performance gap between CPU cores and main memories, designers have implemented prefetchers into various levels of the cache hierarchy, so as to bring data close to the processor before it is needed, meaning it is available in fast storage at the point of use. There are a wide variety of data prefetchers available, but few that can accurately identify data that is accessed through complex data structures.

An alternative scheme is Runahead execution. Here, when the processor is stalled, it continues speculatively fetching an executing instructions from the future so as to perform their memory accesses, then discards them and re-executes them correctly once the pipeline starts up again. This provides an accurate form of prefetching within the core, rather than as separate logic beside the cache. Until recently though, Runahead techniques couldn't deal with complex access logic either. However, a new scheme, called Vector Runahead, can effectively prefetch these access patterns, providing significant performance increases for certain workloads. The aim of this project is to implement Vector Runahead in the gem5 simulator to reproduce the impressive results obtained, with more advanced extensions possible too.
"Serial" RISC-V Processor in Bluspec SystemVerilog
Contact: Jonathan Woodruff

SERV is a well-known open-source serial RISC-V novelty implementation. That is, all arithmetic is performed bit-serial, ensuring that the data-path is extremely small. This is not only a fun idea; there is hope that 8-bit microcontrollers for implementing state machines might be replaced with a full RISC-V core, benefitting from a modern software toolchain.

That being said, a pure bit-serial datapath is likely unnecessarily small and slow since the register file is likely to use a significant amount of SRAM, not to mention the instruction or data memory. This project would reimplement a SERV-like pipeline in parameterisable Bluespec SystemVerilog which is able to operate in arbitrary quanta, e.g. 4 or 8 bits at a time. This would allow an evaluation of area versus performance, and invite comparisons with the original SERV core in verilog. The base project would implement RV32i. There may be an option of adding the "m" extension for multiply and divide, "b" for bit manipulation, or "s" for supervisor mode.
CHERI OpenSHMEM for FreeBSD
Contact: Jonathan Woodruff

CHERI enables memory safety for virtual address space, and OpenSHMEM provides a library framework for computing using distributed address spaces. This project would port the OpenSHMEM library to run in a FreeBSD (nay, CheriBSD) user-space application in pure-capability mode, using capabilities to refer to all allocations. This project would port two or three OpenSHMEM applications to run under this library and evaluate their performance characteristics. This project would then reason about the security properties of CHERI OpenSHMEM in a truly distributed system where memory references cannot be fabricated. An extension might extend the library to actually share address space with a user process on another FreeBSD instance without coherent memory, necessitating network communication in some OpenSHMEM primitives, noting repercussions for security. Another extension might move from the QEMU CHERI-RISC-V emulator to CHERI-Toooba hardware to measure performance.
Crazy Fast RISC-V for Stratix 10 FPGAs
Contact: Simon Moore

Stratix 10 FPGAs can support much higher clock frequencies than older FPGAs architectures provided designs are sufficiently pipelined. The project idea is to design a number of simple RISC-V cores (probably just integer, i.e. "I" class) that aim to push the clock frequency as high as possible even if it ends up with a crazily deep pipeline that may not perform well (i.e. trading higher clock frequencly for lower IPC). Intel's RISC-V softcore (NIOS-V) clocks at around 360MHz, but the hypothesis for this project is that a RISC-V core can be clocked at over 600MHz.
Hardware Stream Compression
Contact: Jonathan Woodruff

A FIFO stream of tokens is a convenient and efficient structure in hardware, and is therefore extremely common. Sometimes it is convenient to have a logical stream that is quite verbose and sparse, containing fields that are not populated for all cases. One might then attempt a custom "packed" format that eliminates redundant fields or shares bits where two fields are not used at the same time. It would be far more convenient to use a general set of modules that could compress and decompress an arbitrary data stream.

This project would develop a pair of modules in Bluespec SystemVerilog that expose a streaming interface for an arbitrary data type but pass data between themselves in a compressed byte-stream (or similar format). The byte-stream would use only the number of bytes necessary to encode each token. Several schemes might be explored for hardware compression and decompression, trading off complexity for efficiency. Strategies might use static or dynamic dictionaries, for example. Example use cases might include RVFI, a verbose CPU tracing format, or wide AXI requests. An extension might evaluate the compression scheme on a multi-core FPGA system-on-chip network.
Cache Prefetcher
Contact: Jonathan Woodruff

The RiscyOO is an open source, out-of-order, superscalar RISC-V processor core. The RiscyOO memory subsystem is capable of an impressive level of parallelism, supporting many outstanding memory requests. Sadly, RiscyOO so far lacks a cache prefetcher. This project would develop a simple interface for a generic cache prefetcher for RiscyOO and would integrate this module into the RiscyOO design in at least one place, but possibly both in L1 and L2 caches. This project would then implement several instances of this module of varying complexity. For example, a next-line prefetcher (particularly useful for the instruction cache), and a stride prefetcher. An extension might evaluate performance on FPGA under a full operating system running SPEC benchmarks, which is routinely done for other variants of this core.
Next Generation CHERI Tag Controller
Contact: Jonathan Woodruff

CHERI is an instruction set extension that requires single-bit tags for each word in memory. This tagged memory is often emulated using standard DRAM holding both data and a table of tags, with a tag controller to assemble tagged words for use by the cache hierarchy. In addition to a flat table design, we built a tag controller with a hierarchical table, and ARM's Morello implemented a tag controller with cache with compression. Since these designs were made, we have become able to compile large-scale programs to use CHERI instruction sets and to trace execution of these programs. In addition, we have developed a set of tools based on DynamoRIO Cache Simulator to consume traces of CHERI programs and simulate various cache hierarchies.

This project would extend the DynamoRIO cache framework with models of the existing tag cache mechanisms and characterise their efficiency with large-scale CHERI workloads. This project would then develop a next-generation cache controller that would aim to reduce tag fetch delays by an order of magnitude through some mixture of prefetching and more advanced compression. This work is relevant to the state-of-the-art and is likely to result in a publication as well as industry engagement.
Using RISC-V vector instructions for Software Defined Radio
Contact: Michael Roe

Overview
The RISC-V instriction set architecture contains an optional set of single instruction multiple data (SIMD) vector instructions. Software defined radio applications (such as gnuradio) are typically mostly written in high-level languages such as C++, with only the performance-critical inner loops written using assembly language vector instructions. The goal of this project is to write a RISC-V equaivelent of some of the parts of gnuradio that currently x86 vector instructions.
This project assumes familiarity with
- The principles explained in the Digital Signal Processing lecture course
- Assembly language programming
Depending on the how the project goes, it might also involve:
- Fixes to the Linux operating system kernel, either in context switching or cpu feature detection
- Fixes to the LLVM compiler
You will need:
- RISC-V CPU hardware or an emulator that supports the vector instructions. Most currently available RISC-V hardware does not support the vector instructions; the Allwinner D1 chip supports an obsolete version of the vector instructions, because the specification wasn't finalized before the chip was designed. However, emulators such as QEMU or spike (probably) support the vector instructions, and could be used in this project.
- An assembler and/or C compiler that supports the vector instructions. I believe that LLVM supports the vector instructions; some bug-fixing of the compiler might turn out to be needed.
- A version of the operating system (e.g. Linux) that saves and restores the contents of the vector registers on context switch. Some investigation as to the current implementation status may be needed.
- Ideally, the operating system should communicate to the user space program whether or not the CPU supports the vector instructions. (On Linux, either by /dev/cpuinfo or hwcap.h). The Linux kernel portion of this is believed to be mostly working, at least in the case of the basic ("V") vector extension. Some fixing of the operating system might be needed here.
Short cuts
It's good for project plans to have a way of getting something working quickly, if the long-term goal turns out to be harder or more time consuming than anticipated.
- It's not necessary to write RISC-V versions of all of the gnuradio primitives that are written in x86 assembler; just doing a few of the most performance critical ones would get most of the benefit.
- It ought to be possible to run the assembly-language inner loops in isolation, without the rest of the software defined radio application. (i.e. just run a test program that calls that subroutine, without the rest of the SDR app). If operating system issues are getting in the way, it ought to be possible to run this "bare metal" i.e. directly on the CPU without an operating system, or with a very mininal operating system such as FreeRTOS.
Modelling of concurrency-relevant architectural decisions in Agile
Contact: Jasmin Jahic

Finding concurrency bugs is a hard problem. Existing static and dynamic approaches are imprecise when trying to reverse engineer architectural decisions related to shared memory between threads and intended synchronization. This makes it hard for testers to choose adequate testing approaches for concurrent software.

Alternative to reverse-engineering is documentation of these decisions during development. There exist approaches that instruct developers which concurrency-related decisions to document during development. However, suggested modelling techniques do not scale. The goal of this approach is to develop a tool that integrates well with Agile environment and enables developers to document and update their concurrency-related decisions.
Enhancement of Lockset Eraser algorithm for finding concurrency bugs
Contact: Jasmin Jahic

Finding concurrency bugs is a hard task, because the problem itself is NP-hard and is conditioned by non-deterministic behavior of software. One approach for finding concurrency bugs is to execute a software under test, observe its execution trace, and then analyze it. For the analysis, there exist various algorithms. One of the most famous algorithms in the community is Eraser Lockset algorithm. However, this algorithm has several flaws.

This work will be focusing on summarizing existing extensions of Lockset Eraser algorithm and creating new extensions to especially handle the intentions related to initialization of shared variables and deciding which locks protect which shared variables. These challenges have negative effect on the precision of the Lockset algorithm, and this work aims to remove them. For execution tracing, we will be using an existing tracing tool.
Continuous testing of concurrent software
Contact: Jasmin Jahic

Continuous software testing considers execution of tests on every software change. While unit tests are perfect for this type of testing, finding concurrency bugs is much more complex. Because of implicit coupling between threads, changes in one thread affect other threads. Therefore, it is necessary to establish a more complex type of testing that includes tracing of execution and analysis of the execution trace.

This project will focus on building a docker-based environment that contains a tracer (we will be using an existing tracer tool) that can be deployed on a server. On every push of a software change to a version control (e.g., GitHub), a webhook will contact the server with the docker and send changes. Based on these changes, it will be necessary to run tests and perform analysis of the execution trace. The assumption is that unit tests are present in software. The basic, brute force approach will consider execution of all tests to generate a trace. An optimization will consider propagation of committed changes and execution of only relevant tests. For the analysis of the execution traces, we will be using basic Eraser Lockset algorithm that finds data races.
GitHub Application for Continuous Engineering
Contact: Jasmin Jahic

Continuous software engineering and DevOps are transforming software engineering process in companies. However, from the software architectural point of view, there are no good tools to support this process. Most of the existing tools are monoliths that are not suited for cooperation between team members and frequent updates to models.

Another issue with the existing tools is that they are far away from code and code development processes. After a while, the gap between the design and implementation becomes inevitable. In order to solve this issue, the idea is to build a set of tools that support architecture processes and integrate it with source code repositories. In the concrete case, we will be using GitHub API. For this purpose, it will be first necessary to choose several challenges related to continuous software engineering, on the software architecture level, and for them implement appropriate support.
Improving Performance of the LLVM Interpreter
Contact: Jasmin Jahic

LLVM compiler infrastructure is a suitable platform for research projects. One of the most interesting parts of LLVM compiler is its interpreter. It enables a possibility of fine-grained tracing of executed instructions with possibilities for quick quantification of code coverage. However, execution of software with LLVM interpreter introduces a significant overhead. The aim of this project is to identify sources of overhead in LLVM interpreter and implement appropriate changes in order to reduce the overhead to appropriate level. As part of the project, it will be necessary to implement components for writing the traces in JSON trace files.

Older Suggestions

The following projects may not have been done before or varients of them could still make good projects

Microarchitectural design of new CHERI domain crossing instructions
Contact: Jonathan Woodruff

This project would implement the two experimental instructions described in the CHERI Architecture document in the section titled "Indirect Sentry Capabilities" (see CHERI Architecture v8; upcoming publication). This section describes two domain crossing instruction variants for CHERI which should present interesting tradeoffs in microarchitectural implementation. The first loads a capability pointer from memory directly into the PC, and the second not only does this but also loads a second, adjacent capability pointer into a register. This project would implement examples of these instructions in the Flute (in-order, single-scalar) and/or Toooba (out-of-order, super-scalar) open-source CHERI-RISCV implementations and explore the complexity and cost of adding these unusual instructions to the microarchitecture. There should be some evaluation of performance using micro-benchmarks to demonstrate the value (or lack thereof) of the more sophisticated implementation options.
CHERI Domain Crossing using Simplified Primitives
Contact: Jonathan Woodruff

The primary domain crossing mechanism for compartmentalisation in CHERI has relied on sealing with object types. The shortcoming of this approach is that it requires a type field to be allocated in the capability encoding which is necessarily limited in size, and therefore introduces an awkward necessity to manage this limited typespace. A simpler mechanism has recently been added to the CHERI architecture; Sealed Entry Capabilities (or "Sentries"). These simply unseal and jump to a target capability, granting access to any capabilities embedded with the target instructions. Sentries have the convenient property that their object type space scales with the virtual address space, while bearing the disadvantage of tying together objects and code.

Libcheri is a library used for compartmentalisation which uses the traditional type-based domain crossing instructions. This project would develop an alternative implementation of libcheri based solely on the Sentry mechanism, likely requiring some system of indirection to ensure the integrity of entry tokens distributed to mutually distrusting domains. This project would seek to approach the performance of type-based libcheri for the existing benchmarks to learn where the type-based mechanism might be replaced for this style of compartmentalisation.
More CHERI sercure hardware and software projects

Follow this link to more CHERI hardware and software project suggestions
Compiler to WebAssembly
Contact: Timothy Jones

Wouldn't it be great to be able to write web applications in your favourite language? The idea of this project is to take some source language (for example, C or ML, but it could be anything) and write a compiler that transforms it to WebAssembly. You'll probably want to consider just a subset of the language and there may be optimisations to perform on the way to produce faster / easier to read code as the end result.
Dynamic binary translator
Contact: Timothy Jones

Let's celebrate Apple's move from x86 to Arm by creating a dynamic binary translator for them. The idea is that it will read in an x86 binary and then either interpret each instruction, or actually emit Arm instructions for each x86 instruction that it then directly executes. It would probably be best to restrict the project to a subset of x86, or a different source ISA if preferred (e.g. AArch32 to AArch64).
Interpreter or JIT compiler
Contact: Timothy Jones

In a similar vein, why not create an optimised interpreter or JIT compiler from your favourite source language. This could be an established language, such as C or OCaml, or a minimalist language, such as P'' (or a descendent), or an existing bytecode (Java being the obvious choice, but it could be the CPython bytecode or something else). Or feel free to suggest your own.
Framework for Targeted Tracing of Software Execution with Debug Symbols
Contact: Jasmin Jahic

Reasoning about software quality, potential bugs, and its dynamic behaviour (e.g., execution hot spots) is not an easy task. Existing approaches for doing so predominantly rely on static analysis. However, static analysis is inherently imprecise. The goal of this project is to design and implement a framework for targeted tracing of software execution that considers relating of executed instructions with debug symbols in order to provide more information for the execution trace analysis.

The basic technology for tracing will be based on DynamoRIO binary instrumentation tool. It will be necessary to enable DynamoRIO to extract debug symbols according to executed instructions. Furthermore, it will be necessary to create an external execution engine of this framework in a language more suitable for doing complex analysis (e.g., Python or Java). The framework should be configurable with options of targeting specific functions and threads for execution tracing, with the possibility of an online configuration (i.e., during the execution). The framework should write the execution trace in a configurable form of JSON files.
GitHub Application for Continuous Engineering
Contact: Jasmin Jahic

Continuous software engineering and DevOps are transforming software engineering process in companies. However, from the software architectural point of view, there are no good tools to support this process. Most of the existing tools are monoliths that are not suited for cooperation between team members and frequent updates to models.

Another issue with the existing tools is that they are far away from code and code development processes. After a while, the gap between the design and implementation becomes inevitable. In order to solve this issue, the idea is to build a set of tools that support architecture processes and integrate it with source code repositories. In the concrete case, we will be using GitHub API. For this purpose, it will be first necessary to choose several challenges related to continuous software engineering, on the software architecture level, and for them implement appropriate support.
Improving Performance of LLVM Interpreter
Contact: Jasmin Jahic

LLVM compiler infrastructure is a suitable platform for research projects. One of the most interesting parts of LLVM compiler is its interpreter. It enables a possibility of fine-grained tracing of executed instructions with possibilities for quick quantification of code coverage. However, execution of software with LLVM interpreter introduces a significant overhead. The aim of this project is to identify sources of overhead in LLVM interpreter and implement appropriate changes in order to reduce the overhead to appropriate level. As part of the project, it will be necessary to implement components for writing the traces in JSON trace files.
Support for LLVM interpreter to run multithreaded applications
Contact: Jasmin Jahic

LLVM compiler infrastructure is a suitable platform for research projects. One of the most interesting parts of LLVM compiler is its interpreter. It enables a possibility of fine-grained tracing of executed instructions with possibilities for quick quantification of code coverage. However, LLVM interpreter does not support execution of multithreaded software. Furthermore, LLVM interpreter faces issues when executing functions from dynamically and statically linked libraries. The goal of this project is to implement components within the LLVM interpreter and support execution and tracing of multithreaded software.
Graphical System-on-Chip Constructor for Bluespec
Contact: Jonathan Woodruff

Several commercial vendors have graphical tools for connecting components of a system on chip. These are generally FPGA-specific, or belong to an expensive commercial ASIC package of tools. One challenge is that the SystemVerilog base language is complex to analyse and difficult to parameterise, so automatic generation of wires and converters can be quite involved. Bluespec SystemVerilog is a recently open-sourced hardware description language which is far more parameterisable than SystemVerilog, and the architecture group has recently developed a general-purpose AXI library in the BlueStuff collection of tools.

This project would develop a graphical user interface that could generate a system-on-chip fabric to connect Bluespec AXI masters and slaves in an arbitrary arrangement, including busses and width adapters, along with address mappings for each master. In addition to generating Bluespec for the interconnect, this project should also produce Device Tree files to describe the system to be consumed by software using the system. Optionally, the tool may allow editing any of the device tree, graphical description, or the Bluespec source and maintain consistent design across all three.
Graphical Analysis of Bluespec SystemVerilog Designs
Contact: Jonathan Woodruff

Bluespec SystemVerilog is a recently open-sourced hardware description language that is used extensively in the Cambridge computer architecture research group for design exploration and implementation. Bluespec SystemVerilog imposes a greater degree of structure on designs than SystemVerilog, typically requiring transactional interfaces and grouping actions into atomic rules. This structure enables greater analysis of the semantics of a design and has been used for formal reasoning about Bluespec designs.

This project would leverage any existing tools for reasoning about Bluespec designs to construct a graphical representation of an architecture described in Bluespec, displaying all modules hierarchically along with connections between them. Optionally state machines or sets of exclusive rules might be rendered in a useful way. Optionally area and timing from a synthesis of a Bluespec design might be added to the graphical display of the design.
ARM/AArch64/RISC-V backend for Duplo
Contact: Nandor Licker

Duplo is a post-link optimiser for OCaml and C, currently targeting only amd64. The framework relies on a Low-Level Intermediate Representation (LLIR) to represent and optimise both C and OCaml, lowering LLIR through the code generator of LLVM. Besides expanding the framework, building a backend for a RISC processor would allow for a comparison between the improvement the LLVM backend brings in for a CISC architecture and the potential speedup on a RISC architecture.
Feedback-directed optimisation for Duplo
Contact: Nandor Licker

Feedback-directed optimisation relies on dynamic profile information to improve branch and function layout in a binary to better exploit instruction caches. Profile information gathered using perf and Last Branch Records (LBRs) can be used to identify "hot" and "cold" sections in a program, feeding this information to the compiler to improve layout decisions. The project would either reuse existing tools (such as AutoFDO) or manually process perf files to identify branch weights, passing the information to the LLVM target-specific pipeline for block layout. As an extension, the same information could also be used to improve function placement prior to lowering to LLVM.
Comparison of parallel architectures
Contact: Peter Rugg

Modern computing hardware provides parallelism in many forms in the hope that applications can exploit it. Parallelising code is not easy, often requiring a complete rethink of the approach taken to achieve orders-of-magnitude performance gains. This project would take some particular problem and investigate how suited alternative parallel architectures are to solving it. After completing an initial C baseline, parallel implementations can be completed targeting multi-core CPUs, CPU vector instructions, GPUs, and even a custom FPGA implementation. The problem analysed may draw on other aspects of the tripos (bioinformatics, complexity, security, machine-learning/AI) exercising a broad range of skills in addition to architecture. The project could also emphasise evaluation, digging into how and why the implementations perform differently. The architecture group may be able to provide limited access to many-core CPU/GPU servers, as well as FPGA boards.

Problems investigated in the past: Breaking of Lorenz cipher (17-18), Bioinformatics: sequence alignment (19-20).

Ideas for future projects: Game AI (Chess? Go?) or Traveling salesman
Timing Model for an ISA Simulation
Contact: Marno van der Maas

Instruction-level simulators are convenient to use, easy to understand and widely available, e.g. Spike for RISC-V. However, if performance figures are required, a timing model is needed. Rather than simulate the whole pipeline in detail, one approach is to add a timing model to an ISA simulator, which has the advantage that the timing model doesn’t need to execute the instruction but instead simply deal with classes of instructions. For example, there need be only one simple model to figure out the timing of all R-type arithmetic/logical operations. Similarly, there will be one model for all branches, another for loads/stores, etc. The models could initially be simple: e.g. R-type instructions all take one cycle, but then refined, e.g. taking into account forwarding paths that might introduce load-to-use dependancies on R-type instructions. Similarly, branches might be modelled as always introducing one bubble into the pipeline, through to modelling branch predictors. Memory operations could be timed via a simple statistical hit rate model, or you could model cache hierarchies in various levels of detail.
Automatic Generation of Synchronisation Primitives for Multithreaded Programs
Contact: Timothy Jones
Contact: Jasmin Jahic

Reasoning about synchronisation mechanisms in multithreaded software is hard. This work aims to create an approach that would free developers from this challenge by generating optimal synchronisation mechanisms for the software under consideration. During this work, it will be necessary to create an approach that reasons about concurrency bugs, potential synchronisation mechanisms, and developer’s synchronisation intentions. It will be necessary to create a conceptual solution for this challenge, an algorithm for generation of synchronisation mechanisms, and a prototype of a tool that demonstrates the benefits of this approach.
KVM-Accelerated Full-System Binary Translation
Contact: Timothy Jones
Contact: Gary Guo

In order for binary-translating emulators to support full-system emulation, they need to translate virtual addresses to physical addresses on each memory access. QEMU handles this by creating a software direct-mapped TLB and generating code that checks this TLB for each memory-access instruction. This software-TLB lookup requires at least two extra memory accesses for each original memory-access instruction and therefore hurts performance.

The Kernel-based Virtual Machine (KVM) allows userspace applications to run as VM hypervisors. This project will develop a system that lets DBT-ed code execute inside a guest VM, leveraging the CPU's second-level address translation to do the address translation for us. Instead of maintaining a software TLB, a shadow page table would be used, which should provide considerable performance improvements.
RISC-V Implementation in Bluespec
Contact: Simon Moore
Contact: David Chisnall

Note that since this project was suggested we've implemented a number of RISC-V cores in Bluespec including ones extended with CHERI security feactures - see CHERI-RISC-V
The University of California, Berkeley is developing the RISC-V open instruction set architecture to promote open source research into computer architecture, but their current implementations are simple, unproven, user-mode designs. At the Cambridge Computer Laboratory, we have been developing the BERI 64-bit MIPS processor which now has a mature design with register forwarding, branch prediction, a MMU, floating point, a dependable cache heirarchy as well as a mature system on chip.
This project would implement the RISC-V ISA instead of the 64-bit MIPS ISA using the BERI infrastructure. The base project would include user-mode, 32-bit instructions. Optional extensions, of which at least one should be attempted, include floating point instructions, 16-bit instructions, and full system support (which is preliminary in the specification). The resulting processor should be able to run code compiled with riscv-gcc from Berkeley. The student may also attempt or colaborate to develop an LLVM backend for the RISC-V ISA. This project will explore implementation implications of the experimental RISC-V instruction set as well as provide insight into the efficiency of the ISA when running compiled code.

We have an online web tutor to allow people to quickly learn the Bluespec SystemVerilog hardware description language: http://www-bluespec.cl.cam.ac.uk/
Real FPGA Virtual I/O
Contact: Simon Moore

This is a fairly advanced project. VirtIO is used to virtualize devices for virtual machines. It provides an abstraction layer between the guest OS and the virtual machine. Now that we have System-on-Chip (SoC) FPGAs it may be possible to treat the ARM core (running Linux or FreeBSD) as the guest OS and a NIOS core mimicking the virtual machine with shared memory (or a FIFO) between the two. The NIOS could then be replaced with some custom FPGA hardware to consume (or produce) VirtIO. For example, it would be good to have a VirtIO stream to Avalon Stream adaptor, or a VirtIO block device that could scan/search through blocks of data. Such an approach would allow FPGA acceleration while not having to deal with low-level configuration details of a particular device.
Convolutional neural networks on FPGA
Contact: Jonathan Woodruff (jdw57@cl)

Develop a convolutional neural network engine on FPGA.. Convolutional neural networks are finding applications in big data learning but are mostly running on standard CPUs or GPUs. This project would design a hardware/software architecture for efficient processing of convolutional neural networks on FPGA using either the BlueVec vector processor or NiosII CPU cores with custom accelerators. BlueVec is an opensource vector processor written in BlueSpec System Verilog for synthesis on FPGA and has been shown to be very efficient for low-precision arithmetic such as that used in convolutional neural networks, and may be a useful starting point for this project.
Using reinforcement learning to achieve backend optimisation for quantum compilers
Contact: Steven Herbert
Contact: Timothy Jones
Contact: Robert Mullins

With the advent of the first generation of quantum computers (for example Google, Rigetti and IBM in the US; and OQC, Universal quantum and NQIT in the UK), attention has turned to compiling quantum algorithms to run on these machines. The emerging norm is a quantum compiler structure that is, in some important respects closely analogous to that of a classical compiler. In particular, the front-end, optimiser, back-end division has been retained, in which the quantum circuit (typically encoded in openQASM) plays the role of the intermediate representation (IR).
Just as in the classical case, the backend receives the IR which is in some sense optimised, but in a manner that is not specific to the target hardware. Specifically, quantum circuits consist of two-qubit gates, and implicitly assume complete connectivity - that is, any two pair of qubits can interact (gate together). This, however, contravenes physical constraints, which mean that there will be some limitation on the actual connectivity, for example the qubits may be laid out in a square grid, with each only able to interact with its four neighbours. It is possible to swap qubits, therefore general quantum algorithms can be executed on such machines, but with an execution time cost incurred by the insertion of these swaps. Therefore the role of the quantum compiler backend is to optimally map the IR (which is a quantum circuit assuming complete connectivity) to a target architecture with some known limited connectivity.
Typically ad-hoc and / or heuristic approaches have been taken to this quantum backend optimisation problem, however we have previously shown that it can be expressed as a reinforcement learning problem, with proof of principle results demonstrating the potential of this technique. The goal of this project is therefore to improve and extend this work.
Use-After-Free Security Mitigations
Contact: Timothy Jones
Contact: Sam Ainsworth

Use-after-free vulnerabilities, where a victim mistakenly reuses data that has been freed and reallocated by an attacker, have plagued software written in low-level languages, such as C and C++, becoming one of the most frequent classes of exploited software bugs. This project aims to mitigate this threat in software at low performance overhead, by extending a state-of-the-art memory allocator to perform efficient checks on programmer-freed memory. This will involve a variety of optimizations based on virtual memory and past system behaviour, to achieve better than state-of-the-art performance in a low-level C allocator.
Binary Alias Analysis
Contact: Timothy Jones

One of the (many!) problems with transforming application binaries is that a significant amount of information about the code has been lost during the compilation stages. Analysing and transforming binaries is an important task because it allows us to optimise programs even when we don't have access to the source code. A number of transformations rely on knowing whether two instructions access the same locations in memory, so that we can ensure we maintain application correctness when we perform our optimisations. Recently, a number of analyses have been proposed to identify the memory objects that a program uses, specifically for Intel x86 binaries. The aim of this project is to implement a subset of these analyses for ARM binaries, and develop strategies for coping with facets of the ISA that are not present in x86, like predicated execution. If time permits, this analysis could then be used to implement a code motion transformation to safely move instructions around in the binary.
Ultra-Fine-Grained Parallelisation
Contact: Timothy Jones

Automatic program parallelisation is the process of taking a sequential application and splitting it into threads that can be run independently. One of the challenges of automatic parallelisation is defining the amount of code that each thread should execute. This project seeks to take this to the extreme, by splitting up the application into very small units of work, with threads executing many of these tasks each. This could be performed by hand or, ideally, implemented as a pass within an optimising compiler, such as LLVM.
JavaScript Parallelisation
Contact: Timothy Jones

This is an ambitious project to evaluate the potential for parallelisation of JavaScript applications. The main idea is to instrument the code generated by a JavaScript engine (e.g. Google's V8) and assess the parallelism available under different scenarios (e.g. across each iteration of loops). We have already developed a basic framework for analysing generic code, but it would need extending and adding to the compiler. To tackle this project, you'd need to be willing to learn about the internals of a dynamic compiler and be confident making changes to incorporate the instrumentation.
Energy-Efficient Caching
Contact: Timothy Jones

Processor caches exploit both spatial and temporal locality to reduce the latency of accessing memory. In an ideal cache, when an item of data is brought in it would be free to occupy any position within the cache that it liked, replacing the least recently used value wherever it may be. In reality, fully-associative caches such as these are too costly to implement in hardware. At the other extreme, direct-mapped caches restrict each data item to only one position within the cache, but these suffer from a significant number of conflict misses, when two data items map to the same position. Therefore a compromise is found with some form of set-associativity, usually restricting each data item to a small number of positions.

Wouldn't it be great to be able to combine the best of both worlds by having fully-associative caches at the cost of direct-mapped hardware? Recent research has pointed to a potential method for achieving this, but as yet nobody has applied the concept to caches. This project aims to be the first to do this. It will perform research into this type of cache hardware, evaluating the trade-offs involved and determining how best to make use of this recent research. The aim is to create a cache that has the low miss rate of a fully-associative cache, but with an energy consumption close to a direct-mapped cache.
Flexible I/O for the lowRISC SoC
Contact: Robert Mullins

The lowRISC project (www.lowrisc.org) is aiming to produce a competitive open-source SoC. One important part of this project is the design of an array of simple I/O coprocessors called "Minions". Soft peripheral interfaces can be created by programming these cores. They can also be used to off-load work from the SoC's main processors, e.g. by filtering or preprocessing I/O.
This project will explore the architecture of the Minions and the thin layer of custom logic that will be placed between the I/O pins and the Minion cores themselves (the I/O shim). The aim will be to develop an implementation that is able to support the widest range of interface types at the lowest cost. This will involve carefully dividing work between the cores and I/O shim, investigating ISA extensions and devising a suitable interface between the minions and I/O shim.
Slides outlining the lowRISC project can be found here
Exploring Architectural Trade-offs in RISC-V Processors
Contact: Robert Mullins

This project will explore complexity, area, power and performance trade-offs for a number of different processor implementations (targeting the RISC-V ISA). Comparisons will be made to public implementations and those created by the student.
Detailed comparisons will be made using a standard ASIC toolflow.
A Flexible Multipurpose Tagged Memory System
Contact: Robert Mullins

The lowRISC project (www.lowrisc.org) is aiming to produce a competitive open-source SoC. We aim to support a simple tagged memory system to provide protection against control-flow hijack attacks. This project would explore the implementation of the tagged memory system and possible other uses for it, e.g. infinite memory watchpoints, garbage collection, accelerating existing debug tools, locks on every word, simple control-flow integrity checks etc.
Optimal Heterogeneous CMP Core Selection
Contact: Timothy Jones

As we project into the future, continued increases in transistor counts, coupled with tight processor power constraints, will lead to increased specialisation of cores within a chip multiprocessor (CMP). However, it is still an open question as to what this heterogeneous CMP will look like.
This project will seek to answer this question by exploring the design space of heterogeneous CMPs. It will use the gem5 simulation infrastructure to run applications on a variety of cores and develop an algorithm to pick the best ones, given constraints such as power or area.
Vectorisation in General Purpose Applications
Contact: Timothy Jones

Modern application processors now contain specialised instructions for operating on a vector of data. This is often called single-instruction, multiple data (SIMD) processing, and common forms are the SSE and AVX instructions in x86 processors, or NEON instructions in ARM. Making use of these instructions can provide significant speed ups.
This project will study the opportunities for vectorisation within general purpose applications, which are traditionally not suited to this kind of processing. It will analyse the loops within each application to determine the inherent vector operations and those that can be exposed through additional compiler transformations. The goal is to expose as many opportunities for vectorisation as possible and, if time allows, implement a vectorisation pass within a compiler to take advantage of these.
RISC-V Implementation in Bluespec
Contact: Simon Moore Contact: David Chisnall

The University of California, Berkeley is developing the RISC-V open instruction set architecture to promote open source research into computer architecture, but their current implementations are simple, unproven, user-mode designs. At the Cambridge Computer Laboratory, we have been developing the BERI 64-bit MIPS processor which now has a mature design with register forwarding, branch prediction, a MMU, floating point, a dependable cache heirarchy as well as a mature system on chip.
This project would implement the RISC-V ISA instead of the 64-bit MIPS ISA using the BERI infrastructure. The base project would include user-mode, 32-bit instructions. Optional extensions, of which at least one should be attempted, include floating point instructions, 16-bit instructions, and full system support (which is preliminary in the specification). The resulting processor should be able to run code compiled with riscv-gcc from Berkeley. The student may also attempt or colaborate to develop an LLVM backend for the RISC-V ISA. This project will explore implementation implications of the experimental RISC-V instruction set as well as provide insight into the efficiency of the ISA when running compiled code.
A Fast Cache Hierarchy for BERI
Contact: Simon Moore Contact: Jonathan Woodruff

The BERI processor, developed at the University of Cambridge, is a 64-bit MIPS processor which is somewhat mature and reliable, but has not so-far been optimised extensively for performance. One of the greatest shortcomings of the current design is cache performance, which only allows a single outstanding transaction.
This project would implement cache heirarchy for the BERI project that can saturate the bandwidth to DRAM for the Terasic DE4. The student would implement instruction and data L1 caches with a shared L2 cache as well as a traffic generator to test the heirarchy. The caches should be pipelined and allow at least 16 outstanding transactions and should run at a high clockspeed on the Terasic DE4. The caches should be parameterizable for size and possibly for line size and associativity. The traffic generator should be capable of both speed tests and complex patterns to test consistency in the caches. An optional extension would be to exend the caches to support coherency when more than one set of L1 caches is present. The final report should present cache performance with a range of parameters which trade off between area, clock speed, and performance.
Application Scheduling for Heterogeneous Systems
Contact: Robert Mullins and Timothy Jones

As energy efficiency becomes the main driver for processor development, heterogeneous systems become attractive, since they allow applications to be scheduled on the cores that best suit their current requirements. Emerging heterogeneous systems include those with close CPU-GPU integration and ARM's big.LITTLE processors.
The goal of this project is to perform an evaluation of a heterogeneous multicore system using the gem5 simulation environment. It will consider a range of cores to determine the optimal system for a group of multi-threaded and multi-programmed workloads. There should not need to be a significant amount of infrastructure development, since gem5 already includes support for multiple, configurable cores. The results will be an analysis of the types of workloads that benefit from heterogeneity in the processor and how they can be successfully scheduled together.
Speculative Guided Parallelisation of Application Binaries
Contact: Robert Mullins and Timothy Jones

With multicore systems now the norm across the computing landscape, and many-core systems on the horizon, it is important for applications to gain performance through parallel execution. However, a significant fraction of existing software is in single-threaded form, and rewriting it to be parallel would be a significant undertaking.
This project seeks to parallelise applications without needing to alter the program source code. Using dynamic binary instrumentation and rewriting, such as within DynamoRio, it will alter program loops as they execute to allow them to run in parallel. To avoid complicated analysis of each loop, it will employ a form of speculation to catch situations where the code must be executed sequentially. The loops to parallelise will be determined in advance.
Acceleration of the Floyd-Steinberg dithering algorithm
Contact: Robert Mullins and Timothy Jones

Applications such as high-speed ink-jet printing need to perform image dithering at Gpixel/s rates. Highly optimised sequential implementations can today only reach ~200Mpixels/sec. This project will explore parallel implementations of the Floyd-Steinberg algorithm, either hand-coded or produced with the aid of an automatic loop parallelisation technique (called HELIX).
There is scope to extend the project to explore source-to-source transformations that could improve the performance of the HELIX technique.
[1] PT Metaxas, Optimal parallel error diffusion dithering
[2] Y Zhang, "Line diffusion: a parallel error diffusion algorithm for digital halftoning"
Scalable Graphics Shader Engine
Contact: Simon Moore

Full-system research is becoming practical using FPGAs, and Cambridge is at the forefront with a full CPU and OS stack with a number of peripherals. However modern systems are not complete without an autonomous graphics processing unit with implications for system-on-chip data flow and prioritization, memory allocation and scheduling, and especially security.
This project would explore the implications of an autonomous graphics processing unit in a system-on-chip architecture. This project would design and build a compact, scalable fragment shader engine in Bluespec SystemVerilog which is able, at least, to apply textures to triangles in a framebuffer. We would recommend an internal 16-bit floating-point pixel format similar to the ARM MALI GPU to save area and improve timing.
Evaluation could include efficiency and performance as well as novel memory protection or sharing ideas when combined with the Cambridge CHERI 64-bit MIPS processor optionally running FreeBSD.
[1] ARM, MALI Shader Arithmatic

Computer Laboratory

ACS & Part II/III Project Suggestions

Suggestions for 2023/2024

Suggestions for 2022/2023

Older Suggestions